Speed up float 5x5 matrix * vector multiplication with SSE - c++

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.
I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?

The Eigen C++ template library for vectors, matrices, ... has both
optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations
so you should give it a try.

In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.
Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.
U =
xxxx
yyyy
zzzz
wwww
tttt
Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.
Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.
The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.
Repeat this for the next four vectors (8 for AVX) and so forth until done.
If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.
Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.

I would suggest using Intel IPP and abstract yourself of dependency on techniques

If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.

This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).
As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.

What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.
The classic optimization technique to trade memory for cycles...

I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.
If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.
I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...
Hope this helps.

If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.
If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.
5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.

Related

Intrinsics: using __128 registers

I am playing with SIMD and thinking to use for Vector operations in 3D math.
Instead having
class Vec4f
{
float val[4];
//+operators here
}
I could have
class SimdVec4f
{
__m128 val; //+operators
}
But since there are just 8 available registers for __m128, what will happend if I want to have more than 8 instances of this class? Does the compiler handles the loading from memory to registers and vica versa on its own as for usual variables?
Thanks for your time and for giving me some insight into this.
It's exactly the same as when you have more int variables than there are integer registers: the compiler may have to spill them to memory if too many are live at the same time, and reload them later. Register allocation for vector registers is done pretty much the same way as register allocation for integer regs, analysing the data flow of a function and figuring out which variables are alive at the same time.
You should think of _mm_load_ps/loadu and store/storeu intrinsics as more describing the type-punning to/from vector types, not as being the only thing that can compile to a vector load or store instruction, or always compiling to a load/store.
And BTW, x86-64 has xmm0..15. Compile for 64-bit if you want code that needs several registers to be efficient.
SSE for 3D vectors:
Generally avoid keeping a single direction/geometry vector in a SIMD vector. You can add efficiently, but any cross- or dot-products or length calculations will require shuffling.
It's better if you can use a vector of 4 x values, a vector of 4 y values, etc., so you can compute 4 lengths in parallel. See https://stackoverflow.com/tags/sse/info for more, especially these slides:
SIMD at Insomniac Games (GDC 2015) which show how to lay out your data for efficient SIMD. (Struct of arrays, not array of structs).
See also Parallel programming using Haswell architecture
Sometimes you can get a minor benefit for a single vector in cases where you can't reorganize to compute lots of things in parallel. _mm_setr_ps() can be slow if the source data isn't contiguous, though.
There are already several C++ wrapper libraries for SIMD, such as Agner Fog's GPL Apache-licensed VectorClass, and some others.

iteration direction on an array

Say we have two arrays a and b of a fundamental type (say, a float) and we need to calculate a[i] + b[i] for every valid index i, as well as store the result. What is the best way to iterate over the arrays to maximize cache hits? Is it front-to-back, back-to-front or something else?
For this kind of operation you should use the auto-vectorization of your compiler. Iterate small i to large i. Also, the answer depends on what you mean by "store the result" and the number n of items items you are going to iterate over.
If you mean c[i] = a[i] + b[i] and n is not too small then your compiler's auto-vectorizer will optimize this best without any more changes. Even MSVC will get that one correct (at least for SSE). Your compiler will have to do some adjustments for n not a multiple of 4 (or 8 for AVX) and alignment but this cost will be amortized across n and this overhead will have a negligible effect except for small n. If n is small then you might want to consider alignment. How small is small has to be determined but I would guess it's much less than 100.
If you mean sum + = a[i] + b[i], a reduction, then you do need to think about this. This has a dependency chain so you need to unroll your loop 3-10 times. Additionally, you need to use a relaxed floating point model since floating point arithmetic is not associative and the auto-vectorization won't kick in without it so add -ffast-math to GCC (/fp:fast to MSVC). If you unroll the loop and use a a relaxed floating point model then GCC, ICC, Clang, and MSVC should auto-vectorize your reduction efficiently.
In order to utilize the cache pre-fetch capability you need to read the arrays from front to back sequentially.
Furthermore, the arrays should be SSE aligned (16 byte). Even more important is that the items (e.g. floats) will be aligned on their size (4 bytes for floats). This is important so data will not cross cache lines (slower read).
After the arrays are aligned, you can use SSE/AVX to read, add and store the results doing 4 or 8 operations in a single instruction.
Edit:
You can read more on cache prefetching here and in depth description in the Intel SW Developer Manual.

How to optimize this product of three matrices in C++ for x86?

I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:
A*A'*Y, where: A is an m-by-n matrix,
A' is its conjugate transpose,
Y is an m-by-k matrix
Typical characteristics:
- k is much smaller than both m or n (k is typically < 10)
- m in the range [500, 2000]
- n in the range [100, 1000]
Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y). My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.
My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.
However, the expression A * (A.adjoint() * Y) (written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:
Using the typical dimensions described above, the input matrix A usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.
A and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.
Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.
Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).
At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.
This is not a full answer (yet - and I'm not sure it will become one).
Let's think of the math first a little. Since matrix multiplication is associative we can either do
(A*A')Y or A(A'*Y).
Floating point operations for (A*A')*Y
2*m*n*m + 2*m*m*k //the twos come from addition and multiplication
Floating point operations for A*(A'*Y)
2*m*n*k + 2*m*n*k = 4*m*n*k
Since k is much smaller than m and n it's clear why the second case is much faster.
But by symmetry we could in principle reduce the number of calculations for A*A' by two (though this might not be easy to do with SIMD) so we could reduce the number of floating point operations of (A*A')*Y to
m*n*m + 2*m*m*k.
We know that both m and n are larger than k. Let's choose a new variable for m and n called z and find out where case one and two are equal:
z*z*z + 2*z*z*k = 4*z*z*k //now simplify
z = 2*k.
So as long as m and n are both more than twice k the second case will have less floating point operations. In your case m and n are both more than 100 and k less than 10 so case two uses far fewer floating point operations.
In terms of efficient code. If the code is optimized for efficient use of the cache (as MKL and Eigen are) then large dense matrix multiplication is computation bound and not memory bound so you don't have to worry about the cache. MKL is faster than Eigen since MKL uses AVX (and maybe fma3 now?).
I don't think you will be able to do this more efficiently than you're already doing using the second case and MKL (through Eigen). Enable OpenMP to get maximum FLOPS.
You should calculate the efficiency by comparing FLOPS to the peak FLOPS of you processor. Assuming you have a Sandy Bridge/Ivy Bridge processor. The peak SP FLOPS is
frequency * number of physical cores * 8 (8-wide AVX SP) * 2 (addition + multiplication)
For double precession divide by two. If you have Haswell and MKL uses FMA then double the peak FLOPS. To get the frequency right you have to use the turbo boost values for all cores (it's lower than for a single core). You can look this up if you have not overclocked your system or use CPU-Z on Windows or Powertop on Linux if you have an overclocked system.
Use a temporary matrix to compute A'*Y, but make sure you tell eigen that there's no aliasing going on: temp.noalias() = A.adjoint()*Y. Then compute your result, once again telling eigen that objects aren't aliased: result.noalias() = A*temp.
There would be redundant computation only if you would perform (A*A')*Y since in this case (A*A') is symmetric and only half of the computation are required. However, as you noticed it is still much faster to perform A*(A'*Y) in which case there is no redundant computations. I confirm that the cost of the temporary creation is completely negligible here.
I guess that perform the following
result = A * (A.adjoint() * Y)
will be the same as do that
temp = A.adjoint() * Y
result = A * temp;
If your matrix Y fits in the cache, you can probably take advantage of doing it like that
result = A * (Y.adjoint() * A).adjoint()
or, if the previous notation is not allowed, like that
temp = Y.adjoint() * A
result = A * temp.adjoint();
Then you dont need to do the adjoint of matrix A, and store the temporary adjoint matrix for A, that will be much expensive that the one for Y.
If your matrix Y fits in the cache, it should be much faster doing a loop runing over the colums of A for the first multiplication, and then over the rows of A for the second multimplication (having Y.adjoint() in the cache for the first multiplication and temp.adjoint() for the second), but I guess that internally eigen is already taking care of that things.

How much effort do you have to put in to get gains from using SSE?

Case One
Say you have a little class:
class Point3D
{
private:
float x,y,z;
public:
operator+=()
...etc
};
Point3D &Point3D::operator+=(Point3D &other)
{
this->x += other.x;
this->y += other.y;
this->z += other.z;
}
A naive use of SSE would simply replace these function bodies with using a few intrinsics. But would we expect this to make much difference? MMX used to involve costly state cahnges IIRC, does SSE or are they just like other instructions? And even if there's no direct "use SSE" overhead, would moving the values into SSE registers and back out again really make it any faster?
Case Two
Instead, you're working with a less OO-based code base. Rather than an array/vector of Point3D objects, you simply have a big array of floats:
float coordinateData[NUM_POINTS*3];
void add(int i,int j) //yes it's unsafe, no overlap check... example only
{
for (int x=0;x<3;++x)
{
coordinateData[i*3+x] += coordinateData[j*3+x];
}
}
What about use of SSE here? Any better?
In conclusion
Is trying to optimise single vector operations using SSE actually worthwhile, or is it really only valuable when doing bulk operations?
In general you will need to take additional steps to get the best out of SSE (or any other SIMD architecture):
data needs to be 16 byte aligned (ideally)
data needs to be contiguous
you need enough data to make the SIMD operation worthwhile
you need to coalesce as many operations as you can to mitigate the costs of loads/stores
you need to be aware of the cache/memory hierarchy and its performance impact (e.g. use strip-mining/tiling)
This Gamasutra article shows what it takes to make fast SSE-based code. It covers your "Case 1" in detail.
The source code is available from the author's homepage.
Also Slides + text: SIMD at Insomniac Games (GDC 2015) discuss why using a SIMD vector to hold a single x,y,z,(padding) geometry vector is not efficient. (Because you'll need horizontal shuffles and a scalar sqrt to do things like the length of a vector, sqrt(sum of squares), compared to doing 4 lengths in parallel from 3 vectors of x0,x1,x2,x3, y0-3, z0-3.)
See also other links in the SSE tag wiki.
it is valuable if your is case is that you do a lot of same calculations on range of data. for example you calculate square roots of many-many equations. you can load 4 values in sse registers and call operations once. this will increase performance by 4.
and there are libraries that have all sse optimization inside them. don't reinvent bicycle.
I tried Case One at work a couple of years ago and the performance gain was barely measurable. In the end I decided to skip it since all the hassle with aligning all Point3D on 16 byte boundaries made it not worthwhile.
As you've correctly guessed SSE is most suited to bulk operations where they can give a pretty good speed up. Before you go ahead and use the SSE intrinsics check what code the compiler is already generating. I know from experience that for instance Visual Studio is pretty good at using SSE-optimizations.

How does BLAS get such extreme performance?

Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:
Custom Implementation, 10 trials of
1000x1000 matrix multiplication:
Took: 15.76542 seconds.
BLAS Implementation, 10 trials of
1000x1000 matrix multiplication:
Took: 1.32432 seconds.
This is using single precision floating point numbers.
My Implementation:
template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 )
throw std::runtime_error("Error sizes off");
memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}
I have two questions:
Given that a matrix-matrix multiplication say: nxm * mxn requires n*n*m multiplications, so in the case above 1000^3 or 1e9 operations. How is it possible on my 2.6Ghz processor for BLAS to do 10*1e9 operations in 1.32 seconds? Even if multiplcations were a single operation and there was nothing else being done, it should take ~4 seconds.
Why is my implementation so much slower?
A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-OrtĂ­. They provide a free download version.
BLAS is divided into three levels:
Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SSE).
Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level1 functions. However, you can boost the performance of this functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.
Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level2 functions. But Level3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.
By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).
Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).
IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:
Maybe its not possible to provide a cache optimized implementation of these algorithms (i.e. you would loose more then you would win)
These algorithms are numerically not stable. As BLAS is the computational kernel of LAPACK this is a no-go.
Although these algorithms have a nice time complexity on paper, the Big O notation hides a large constant, so it only starts to become viable for extremely large matrices.
Edit/Update:
The new and ground breaking paper for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variants is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops
for (l=0; l<MR*NR; ++l) {
AB[l] = 0;
}
for (l=0; l<kc; ++l) {
for (j=0; j<NR; ++j) {
for (i=0; i<MR; ++i) {
AB[i+j*MR] += A[i]*B[j];
}
}
A += MR;
B += NR;
}
The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:
ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)
Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such a performance. And why it does not matter whether you use row or column major storage!
The final benchmarks are here (we called our project ulmBLAS):
Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen
Another Edit/Update:
I also wrote some tutorial on how BLAS gets used for numerical linear algebra problems like solving a system of linear equations:
High Performance LU Factorization
(This LU factorization is for example used by Matlab for solving a system of linear equations.)
I hope to find time to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.
Ok, here you go: Coding a Cache Optimized Parallel LU Factorization
P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:
Experiments on uBLAS.
Here a similar project with BLAZE:
Experiments on BLAZE.
So first of all BLAS is just an interface of about 50 functions. There are many competing implementations of the interface.
Firstly I will mention things that are largely unrelated:
Fortran vs C, makes no difference
Advanced matrix algorithms such as Strassen, implementations dont use them as they dont help in practice
Most implementations break each operation into small-dimension matrix or vector operations in the more or less obvious way. For example a large 1000x1000 matrix multiplication may broken into a sequence of 50x50 matrix multiplications.
These fixed-size small-dimension operations (called kernels) are hardcoded in CPU-specific assembly code using several CPU features of their target:
SIMD-style instructions
Instruction Level Parallelism
Cache-awareness
Furthermore these kernels can be executed in parallel with respect to each other using multiple threads (CPU cores), in the typical map-reduce design pattern.
Take a look at ATLAS which is the most commonly used open source BLAS implementation. It has many different competing kernels, and during the ATLAS library build process it runs a competition among them (some are even parameterized, so the same kernel can have different settings). It tries different configurations and then selects the best for the particular target system.
(Tip: That is why if you are using ATLAS you are better off building and tuning the library by hand for your particular machine then using a prebuilt one.)
First, there are more efficient algorithms for matrix multiplication than the one you're using.
Second, your CPU can do much more than one instruction at a time.
Your CPU executes 3-4 instructions per cycle, and if the SIMD units are used, each instruction processes 4 floats or 2 doubles. (of course this figure isn't accurate either, as the CPU can typically only process one SIMD instruction per cycle)
Third, your code is far from optimal:
You're using raw pointers, which means that the compiler has to assume they may alias. There are compiler-specific keywords or flags you can specify to tell the compiler that they don't alias. Alternatively, you should use other types than raw pointers, which take care of the problem.
You're thrashing the cache by performing a naive traversal of each row/column of the input matrices. You can use blocking to perform as much work as possible on a smaller block of the matrix, which fits in the CPU cache, before moving on to the next block.
For purely numerical tasks, Fortran is pretty much unbeatable, and C++ takes a lot of coaxing to get up to a similar speed. It can be done, and there are a few libraries demonstrating it (typically using expression templates), but it's not trivial, and it doesn't just happen.
I don't know specfically about BLAS implementation but there are more efficient alogorithms for Matrix Multiplication that has better than O(n3) complexity. A well know one is Strassen Algorithm
Most arguments to the second question -- assembler, splitting into blocks etc. (but not the less than N^3 algorithms, they are really overdeveloped) -- play a role. But the low velocity of your algorithm is caused essentially by matrix size and the unfortunate arrangement of the three nested loops. Your matrices are so large that they do not fit at once in cache memory. You can rearrange the loops such that as much as possible will be done on a row in cache, this way dramatically reducing cache refreshes (BTW splitting into small blocks has an analogue effect, best if loops over the blocks are arranged similarly). A model implementation for square matrices follows. On my computer its time consumption was about 1:10 compared to the standard implementation (as yours).
In other words: never program a matrix multiplication along the "row times column" scheme that we learned in school.
After having rearranged the loops, more improvements are obtained by unrolling loops, assembler code etc.
void vector(int m, double ** a, double ** b, double ** c) {
int i, j, k;
for (i=0; i<m; i++) {
double * ci = c[i];
for (k=0; k<m; k++) ci[k] = 0.;
for (j=0; j<m; j++) {
double aij = a[i][j];
double * bj = b[j];
for (k=0; k<m; k++) ci[k] += aij*bj[k];
}
}
}
One more remark: This implementation is even better on my computer than replacing all by the BLAS routine cblas_dgemm (try it on your computer!). But much faster (1:4) is calling dgemm_ of the Fortran library directly. I think this routine is in fact not Fortran but assembler code (I do not know what is in the library, I don't have the sources). Totally unclear to me is why cblas_dgemm is not as fast since to my knowledge it is merely a wrapper for dgemm_.
This is a realistic speed up. For an example of what can be done with SIMD assembler over C++ code, see some example iPhone matrix functions - these were over 8x faster than the C version, and aren't even "optimized" assembly - there's no pipe-lining yet and there is unnecessary stack operations.
Also your code is not "restrict correct" - how does the compiler know that when it modifies C, it isn't modifying A and B?
With respect to the original code in MM multiply, memory reference for most operation is the main cause of bad performance. Memory is running at 100-1000 times slower than cache.
Most of speed up comes from employing loop optimization techniques for this triple loop function in MM multiply. Two main loop optimization techniques are used; unrolling and blocking. With respect to unrolling, we unroll the outer two most loops and block it for data reuse in cache. Outer loop unrolling helps optimize data-access temporally by reducing the number of memory references to same data at different times during the entire operation. Blocking the loop index at specific number, helps with retaining the data in cache. You can choose to optimize for L2 cache or L3 cache.
https://en.wikipedia.org/wiki/Loop_nest_optimization
For many reasons.
First, Fortran compilers are highly optimized, and the language allows them to be as such. C and C++ are very loose in terms of array handling (e.g. the case of pointers referring to the same memory area). This means that the compiler cannot know in advance what to do, and is forced to create generic code. In Fortran, your cases are more streamlined, and the compiler has better control of what happens, allowing him to optimize more (e.g. using registers).
Another thing is that Fortran store stuff columnwise, while C stores data row-wise. I havent' checked your code, but be careful of how you perform the product. In C you must scan row wise: this way you scan your array along contiguous memory, reducing the cache misses. Cache miss is the first source of inefficiency.
Third, it depends of the blas implementation you are using. Some implementations might be written in assembler, and optimized for the specific processor you are using. The netlib version is written in fortran 77.
Also, you are doing a lot of operations, most of them repeated and redundant. All those multiplications to obtain the index are detrimental for the performance. I don't really know how this is done in BLAS, but there are a lot of tricks to prevent expensive operations.
For example, you could rework your code this way
template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off");
memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1, a1,a2,a3;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) {
a1 = cc2*ADim2;
a3 = cc2*BDim1
for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) {
a2=cc1*ADim1;
ValT b = B[a3+cc1];
for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) {
C[a1+cr1] += A[a2+cr1]*b;
}
}
}
}
Try it, I am sure you will save something.
On you #1 question, the reason is that matrix multiplication scales as O(n^3) if you use a trivial algorithm. There are algorithms that scale much better.