How to optimize matrix multiplication operation [duplicate] - c++

This question already has answers here:
Optimized matrix multiplication in C
(14 answers)
Closed 4 years ago.
I need to perform a lot of matrix operations in my application. The most time consuming is matrix multiplication. I implemented it this way
template<typename T>
Matrix<T> Matrix<T>::operator * (Matrix& matrix)
{
Matrix<T> multipliedMatrix = Matrix<T>(this->rows,matrix.GetColumns(),0);
for (int i=0;i<this->rows;i++)
{
for (int j=0;j<matrix.GetColumns();j++)
{
multipliedMatrix.datavector.at(i).at(j) = 0;
for (int k=0;k<this->columns ;k++)
{
multipliedMatrix.datavector.at(i).at(j) += datavector.at(i).at(k) * matrix.datavector.at(k).at(j);
}
//cout<<(*multipliedMatrix)[i][j]<<endl;
}
}
return multipliedMatrix;
}
Is there any way to write it in a better way?? So far matrix multiplication operations take most of time in my application. Maybe is there good/fast library for doing this kind of stuff ??
However I rather can't use libraries which uses graphic card for mathematical operations, because of the fact that I work on laptop with integrated graphic card.

Eigen is by far one of the fastest, if not the fastest, linear algebra libraries out there. It is well written and it is of high quality. Also, it uses expression template which makes writing code that is more readable. Version 3 just released uses OpenMP for data parallelism.
#include <iostream>
#include <Eigen/Dense>
using Eigen::MatrixXd;
int main()
{
MatrixXd m(2,2);
m(0,0) = 3;
m(1,0) = 2.5;
m(0,1) = -1;
m(1,1) = m(1,0) + m(0,1);
std::cout << m << std::endl;
}

Boost uBLAS I think is definitely the way to go with this sort of thing. Boost is well designed, well tested and used in a lot of applications.

Consider GNU Scientific Library, or MV++
If you're okay with C, BLAS is a low-level library that incorporates both C and C-wrapped FORTRAN instructions and is used a huge number of higher-level math libraries.
I don't know anything about this, but another option might be Meschach which seems to have decent performance.
Edit: With respect to your comment about not wanting to use libraries that use your graphics card, I'll point out that in many cases, the libraries that use your graphics card are specialized implementations of standard (non-GPU) libraries. For example, various implementations of BLAS are listed on it's Wikipedia page, only some are designed to leverage your GPU.

There is a book called Introduction to Algorithms. You may like to check the chapter of Dynamic Programming. It has an excellent matrix multiplication algo using dynamic programming. Its worth a read. Well, this info was in case you want to write your own logic instead of using a library.

There are plenty of algorithms for efficient matrix multiplication.
Algorithms for efficient matrix multiplication
Look at the algorithms, find an implementations.
You can also make a multi-threaded implementation for it.

Related

Is there a `numpy.minimum` equivalent in GSL?

I'm working on porting a complex data analysis routine I "prototyped" in Python to C++. I used Numpy extensively throughout the Python code. I'm looking at employing the GSL in the C++ port since it implements all of the various numerical routines I require (whereas Armadillo, Eigen, etc. only have a subset of what I need, though their APIs are closer to what I am looking for).
Is there an equivalent to numpy.minimum in the GSL (i.e., element-wise minimum of two matrices)? This is just one example of the abstractions from Numpy that I am looking for. Do things like this simply have to be reimplemented manually when using the GSL? I note that the GSL provides for things like:
double gsl_matrix_min (const gsl_matrix * m)
But that simply provides the minimum value of the entire matrix. Regardless of element-wise comparisons, it doesn't even seem possible to report the minimum along a particular axis of a single matrix using the GSL. That surprises me.
Are my expectations misplaced?
You can implement an element-wise minimum easily in Armadillo, via the find() and .elem() functions:
mat A; A.randu(5,5);
mat B; B.randu(5,5);
umat indices = find(B < A);
mat C = A;
C.elem(indices) = B.elem(indices);
For other functions that are not present in Armadillo, it might be possible to interface Armadillo matrices with GSL functions, through the .memptr() function.

What is the best way to create a pentadiagonal sparse matrix for finite difference method in c/c++?

In MATLAB, it is very convenient to create a pentadiagonal sparse matrix using commands like this:
I = eye(m); % create identity matrix
e = ones(m,1); % create an array of all 1's
T = spdiags([e -4*e e],[-1 0 1],m,m);
S = spdiags([e e],[-1 1],m,m);
A = (kron(I,T) + kron(S,I))/hˆ2;
I was wondering if there is any neat trick to do the same in c/c++.
There is no sparse Matrix type in C++. But there are a lot of open source algebra libraries around the web (or you can write your own).
Boost uBLAS supports sparse matrices, and it's probably the best choice if you want just to "experiment" finite differences.
If you need more advanced solvers, you should take a look at GSL, or consider the C version of LAPACK.
As for your original question, as far as i know none of those libraries implements a kron function, since it is only a "convenience" routine.

Fortran-style multidimensional arrays in C++

Is there a C++ library which provides Fortran-style multidimensional arrays with support for slicing, passing as procedural parameter and decent documentation? I've looked into blitz++ but its dead!
I highly suggest Armadillo:
Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use
It is a C++ template library:
A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries; this is automatically accomplished through template meta-programming
A simple example from the web page:
#include <iostream>
#include <armadillo>
int main(int argc, char** argv)
{
arma::mat A = arma::randu<arma::mat>(4,5);
arma::mat B = arma::randu<arma::mat>(4,5);
std::cout << A*B.t() << std::endl;
return 0;
}
If you are running OSX the you can use the vDSP libs for free.
If you want to deploy on windows targets then either license the intel equivalents (MKL) or I think that the AMD vector math libs (ACML) are free.

Worse performance using Eigen than using my own class

A couple of weeks ago I asked a question about the performance of matrix multiplication.
I was told that in order to enhance the performance of my program I should use some specialised matrix classes rather than my own class.
StackOverflow users recommended:
uBLAS
EIGEN
BLAS
At first I wanted to use uBLAS however reading documentation it turned out that this library doesn't support matrix-matrix multiplication.
After all I decided to use EIGEN library. So I exchanged my matrix class to Eigen::MatrixXd - however it turned out that now my application works even slower than before.
Time before using EIGEN was 68 seconds and after exchanging my matrix class to EIGEN matrix program runs for 87 seconds.
Parts of program which take the most time looks like that
TemplateClusterBase* TemplateClusterBase::TransformTemplateOne( vector<Eigen::MatrixXd*>& pointVector, Eigen::MatrixXd& rotation ,Eigen::MatrixXd& scale,Eigen::MatrixXd& translation )
{
for (int i=0;i<pointVector.size();i++ )
{
//Eigen::MatrixXd outcome =
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
//delete prototypePointVector[i]; // ((rotation*scale)* (*prototypePointVector[i]) + translation).ConvertToPoint();
MatrixHelper::SetX(*prototypePointVector[i],MatrixHelper::GetX(outcome));
MatrixHelper::SetY(*prototypePointVector[i],MatrixHelper::GetY(outcome));
//assosiatedPointIndexVector[i] = prototypePointVector[i]->associatedTemplateIndex = i;
}
return this;
}
and
Eigen::MatrixXd AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
double membershipSum = 0,outcome = 0;
double currentPower = 0;
Eigen::MatrixXd outcomePoint = Eigen::MatrixXd(2,1);
outcomePoint << 0,0;
Eigen::MatrixXd templatePoint;
for (int i=0;i< imageDataVector.size();i++)
{
currentPower =0;
membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
outcomePoint.noalias() += (*imageDataVector[i] - (prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix* ( *templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]) ))*currentPower ;
}
outcomePoint.noalias() = outcomePoint/=membershipSum;
return outcomePoint; //.ConvertToMatrix();
}
As You can see, these functions performs a lot of matrix operations. That is why I thought using Eigen would speed up my application. Unfortunately (as I mentioned above), the program works slower.
Is there any way to speed up these functions?
Maybe if I used DirectX matrix operations I would get better performance ?? (however I have a laptop with integrated graphic card).
If you're using Eigen's MatrixXd types, those are dynamically sized. You should get much better results from using the fixed size types e.g Matrix4d, Vector4d.
Also, make sure you're compiling such that the code can get vectorized; see the relevant Eigen documentation.
Re your thought on using the Direct3D extensions library stuff (D3DXMATRIX etc): it's OK (if a bit old fashioned) for graphics geometry (4x4 transforms etc), but it's certainly not GPU accelerated (just good old SSE, I think). Also, note that it's floating point precision only (you seem to be set on using doubles). Personally I'd much prefer to use Eigen unless I was actually coding a Direct3D app.
Make sure to have compiler optimization switched on (e.g. at least -O2 on gcc). Eigen is heavily templated and will not perform very well if you don't turn on optimization.
Which version of Eigen are you using? They recently released 3.0.1, which is supposed to be faster than 2.x. Also, make sure you play a bit with the compiler options. For example, make sure SSE is being used in Visual Studio:
C/C++ --> Code Generation --> Enable Enhanced Instruction Set
You should profile and then optimize first the algorithm, then the implementation. In particular, the posted code is quite innefficient:
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
I don't know the library, so I won't even try to guess the number of unnecessary temporaries that you are creating, but a simple refactor:
Eigen::MatrixXd tmp = rotation*scale;
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = tmp*(*pointVector[i]) + translation;
Can save you a good amount of expensive multiplications (and again, probably new temporary matrices that get discarded right away.
A couple of points.
Why are you multiplying rotation*scale inside of the loop when that product will have the same value each iteration? That is a lot of wasted effort.
You are using dynamically sized matrices rather than fixed sized matrices. Someone else mentioned this already, and you said you shaved off 2 sec.
You are passing arguments as a vector of pointers to matrices. This adds an extra pointer indirection and destroys any guarantee of data locality, which will give poor cache performance.
I hope this isn't insulting, but are you compiling in Release or Debug? Eigen is very slow in debug builds, because it uses lots of trivial templated functions that are optimized out of release but remain in debug.
Looking at your code, I am hesitant to blame Eigen for performance problems. However, most linear algebra libraries (including Eigen) are not really designed for your use case of lots of tiny matrices. In general, Eigen will be better optimized for 100x100 or larger matrices. You very well may be better off using your own matrix class or the DirectX math helper classes. The DirectX math classes are completely independent from your video card.
Looking back at your previous post and the code in there, my suggestion would be to use your old code, but improve its efficiency by moving things around. I'm posting on that previous question to keep the answers separate.

Visual C++ BigInt and SecureRandom? Is there a BigInt library with modPow?

I have to port some crypto code to visual c++ from java which (visual c++) I am not very familiar with. I found a library at http://sourceforge.net/projects/cpp-bigint/ that I can use for big integers.
However it does not have an equivalent to javas SecureRandom class. I did find a project in c++ called beecrypt but could not get it to work with Visual Studio 2008.
Does anyone have any experience with these types of libraries? I saw gmp too but couldn't find one that worked with visual studio off the bat.
Before I head down the wrong road any advice?
Thanks!
----UPDATE-------
I seem to have a proof of concept working with the cpp-bigint from above with small numbers. In the library there is no modPow function. For now I created a for loop like:
for(RossiBigInt i("0",DEC_DIGIT); i< r; i++)
{
x = x * g;
x = x % p;
}
This gives me x = g^r mod p but it is very slow. Does anyone know of other BitInteger libraries with the modPow function or know a faster way for me to compute this?
Thanks!
The modPow function can be evaluated efficiently with a "square and multiply" algorithm. In Java it would look like this (if Java's BigInteger did not already have it):
/* Compute x^n mod m. */
static BigInteger modPow(BigInteger x, BigInteger n, BigInteger m)
{
if (n.signum() < 0)
throw new IllegalArgumentException("bwah, negative exponent");
BigInteger r = BigInteger.ONE;
for (int i = n.bitLength() - 1; i >= 0; i --) {
if (n.testBit(i))
r = r.multiply(x).mod(m);
if (i > 0)
r = r.multiply(r).mod(m);
}
return r;
}
With this, the number of loop iteration is equal to the length, in bits, of the exponent, so that the computational time is acceptable.
You still get one or two modular reductions per iteration, so this will not be the fastest exponentiation algorithm ever (modular reductions are substantially more expensive than multiplication). Typical modPow() implementations use Montgomery reduction, which is a clever trick which merges all modular reduction into a single similar operation at the end.
If you have time, implementing your own modular exponentiation would be very pedagogical; you would start by reading chapter 14 of the "Handbook of Applied Cryptography", freely downloadable from this site. However, in this harsh world where mundane considerations of budget often limit creativity and free time, you would probably be happy with an already implemented library. GMP is known to be quite good, but somewhat difficult to use on Windows. You may have better luck with NTL.
For generating random data on Windows you can also use CryptoAPI, specifically the CryptGenRandom method.