SpeedUp Matrix Multiplication - c++

I have to do a matrix boolean multiplication of a matrix with itself in a C++ program and I want to optimize it.
The matrix is symmetric so I think to do a row by row multiplication to reduce cache misses.
I allocated space for matrix in this way:
matrix=new bool*[dimension];
for (i=0; i<dimension; i++) {
matrix[i]=new bool[dimension];
}
And the multiplication is the following:
for (m=0; m<dimension; m++) {
for (n=0; n<dimension; n++) {
for (k=0; k<dimension; k++) {
temp=mat[m][k] && mat[n][k];
B[m][n]= B[m][n] || temp;
...
I did some test of computation time with this version and with another version whit a row by column multiplication like this
for (m=0; m<dimension; m++) {
for (n=0; n<dimension; n++) {
for (k=0; k<dimension; k++) {
temp=mat[m][k] && mat[k][n];
B[m][n]= B[m][n] || temp;
...
I did tests on a 1000x1000 matrix The result showed that the second version ( row by column ) is faster the previous one.
Could you show me why? Shouldn't The misses in the first algorithm be less ?

In the first multiplication approach the rows of the boolean matrices are stored consecutively in memory and also accessed consecutively so that prefetching works flawlessly. In the second approach the cacheline fetched when accessing the element (n,0) can already be evicted when accessing (n+1,0). Whether this actually happens depends on the architecture and its cache hierarchy properties you run your code on. On my machine the first approach is indeed faster for large enough matrices.
As for speeding up the computations: Do not use logical operators since they are evaluated lazy and thus branch misprediction can occur. The inner loop can be exited early as soon as B[m][n] becomes true. Instead of using booleans you might want to consider using the bits of say integers. That way you can combine 32 or 64 elements in your inner loop at once and possibly use vectorization. If your matrices are rather sparse then you might want to consider switching to sparse matrix data structures. Also changing the order of the loops can help as well as introducing blocking. However, any performance optimization is specific to an architecture and class of input matrices.

Speeding suggestion. In the inner loop:
Bmn = false;
for (k=0; k<dimension; k++) {
if ((Bmn = mat[m][k] && mat[k][n])) {
k = dimension; // exit for-k loop
}
}
B[m][n]= Bmn

Related

How to obtain performance enhancement while multiplying two sub-matrices?

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.
#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.
What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?
You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.
For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.
Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.
I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.
To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.
So, something like this:
select sub-matrix A
#pragma omp parallel for
select sub-matrix B
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
could work faster since whole data always stays in same thread (core).

Time complexity of mandelbrot set in term of big O notation

I'm trying to find the time complexity of a simple implementation of mandelbrot set. with following code
int main(){
int rows, columns, iterations;
rows = 22;
columns = 72;
iterations = 28;
char matrix[max_rows][max_columns];
for(int r = 0; r < rows; ++r){
for(int c = 0; c < columns; ++c){
complex<float> z;
int itr = 0;
while(abs(z) < 2 && ++itr < iterations)
z = pow(z, 2) + decltype(z)((float)c * 2 / columns - 1.5,
(float)r * 2 / rows - 1);
matrix[r][c]=(itr== iterations ? '*' : '.');
}
}
Now looking at above code i made some estimation for time complexity in terms of big O notation and want to know if it is correct or not
So we are creating a 2d array traversing it through nested loops and and at each element we are performing an operation and setting a value of that element, if we take n as input size we can say that greater the input the greater will be the complexity, so the time complexity for rowsxcolumns would be O(rxc) and then again we are traversing it for printout, so what would be the time complexity? is it O(rxc)+O(rxc) ? does the function itself have some effect on time complexity when we are doing multiplication and subtraction on rows and columns? If yes then how?
Almost, given r rows, c columns and i iterations then the running time is O(r*c*i). This should be trivial to see if abs(z)<2 is not there. But with this extra condition its not clear how many times will the inner while loop run in total. Yes, it will be less than r*c*i times, so O(r*c*i) is still the upper bound. But perhaps we might do better. Given that for any r,c you compute Mandelbrot set over the same domain with varying resolution then the while loop will run k*r*c*i times for some constant k which is somewhere between area-of-Mandelbrot-set-over-area-of-the-domain and 1 --> Running time of the code is Θ(r*c*i) and O(r*c*i) cannot be improved.
Had you computed the set over [-c,c]x[-r,r] domain with fixed resolution then for any |z|>2 the abs(z)<2 breaks after first iteration. Then O(r*c*i) would not be tight bound and this condition (as all loop conditions) should be considered if you want accurate estimation.
Please don't use malloc, std::vector is safer.
In big-O notation, O(rxc)+O(rxc) collapses to O(rxc).
Since the maximal iteration count is also an input variable, it has an influence on the complexity as well. In particular, the inner loop runs at most n iterations, therefore, your complexity is O(rxcxn).
All other operations are constant, in particular multiplication and addition of complex<float>. These operations by themselves are always O(1), which does not contribut to the overall complexity.

Efficient Tensor Multiplication

I have a Matrix that is a representation of a higher dimensional tensor which could in principle be N dimensional but each dimension is the same size. Lets say I want to compute the following:
and C is stored as a matrix via
where there is some mapping from ij to I and kl to J.
I can do this with nested for loops where each dimension of my tensor is of size 3 via
for (int i=0; i<3; i++){
for (int j=0; j<3; j++){
I = map_ij_to_I(i,j);
for (int k=0; k<3; k++){
for (int l=0; l<3; l++){
J = map_kl_to_J(k,l);
D(I,J) = 0.;
for (int m=0; m<3; m++){
for (int n=0; n<3; n++){
M = map_mn_to_M(m,n);
D(I,J) += a(i,m)*C(M,J)*b(j,n);
}
}
}
}
}
}
but that's pretty messy and not very efficient. I'm using the Eigen matrix library so I suspect there is probably a much better way to do this than either a for loop or coding each entry separately. I've tried the unsupported tensor library and found it was slower than my explicit loops. Any thoughts?
As a bonus question, how would I compute something like the following efficiently?
There's a lot of work that the optimizer of your compiler will do for you under the hood. For once, loops with constant number of iterations are unrolled. That may be the reason why your code is faster than the library.
I would suggest to take a look at the assembly produced with the optimizations turned to really get a grasp on where you can optimize and how really your program looks like once compiled.
Then of course, you can think about parallel implementations either on the CPU (multiple threads) or on GPU (cuda, OpenCL, OpenAcc, etc).
As for the bonus question, if you think about writing it as two nested loops, I would suggest to rearrange the expression so that the a_km term is between the two sums. No need to perform that multiplication inside the inner sum as it doesn't depend on n. Although this will probably give only a slight performance benefit in modern CPUs...

What is the most time efficient way to square each element of a vector of vectors c++

I currently have a vector of vectors of a float type, which contain some data:
vector<vector<float> > v1;
vector<vector<float> > v2;
I wanted to know what is the fasted way to square each element in v1 and store it in v2? Currently I am just accessing each element of v1 multiplying it by itself and storing it in v2. As seen below:
for(int i = 0; i < 10; i++){
for(int j = 0; j < 10; j++){
v2[i][j] = v1[i][j]*v[i][j];
}
}
With a bit of luck, the compiler you are using understands what you want to do and converts it so it uses sse-instruction of the cpu which do your squaring in parallel. In this case your code is close to the optimal speed (on single core). You could also try the eigen-library (http://eigen.tuxfamily.org/) which provides some more reliable means to achieve high performance. You would then get something like
ArrayXXf v1 = ArrayXXf::Random(10, 10);
ArrayXXf v2 = v1.square();
which also makes your intention more clear.
If you want to stay in CPU world, OpenMP should help you easily. A single #pragma omp parallel for will divide the load between available cores and you could get further gains by telling the compiler to vectorize with ivdep and simd pragmas.
If GPU is an option, this is a matrix calculation which is perfect for OpenCL. Google for OpenCL matrix multiplication examples. Basically, you can have 2000 threads executing a single operation or fewer threads operating on vector chunks and kernel is very simple to write.

OpenMP and memory bandwidth restriction

Edit: My first code sample was wrong. Fixed with a simpler.
I implement a C++ library for algebraic operations between large vectors and matrices.
I found on x86-x64 CPUs that OpenMP parallel vector additions, dot product etc are not going so faster than single threaded.
Parallel operations are -1% - 6% faster than single threaded.
This happens because of memory bandwidth limitation (I think).
So, the question is, is there real performance benefit for code like this:
void DenseMatrix::identity()
{
assert(height == width);
size_t i = 0;
#pragma omp parallel for if (height > OPENMP_BREAK2)
for(unsigned int y = 0; y < height; y++)
for(unsigned int x = 0; x < width; x++, i++)
elements[i] = x == y ? 1 : 0;
}
In this sample there is no serious drawback from using OpenMP.
But if I am working on OpenMP with Sparse Vectors and Sparse Matrices, I cannot use for instance *.push_back(), and in that case, question becomes serious. (Elements of sparse vectors are not continuous like dense vectors, so parallel programming has a drawback because result elements can arrive anytime - not for lower to higher index)
I don't think this is a problem of memory bandwidth. I see clearly a problem on r: r is accessed from multiple threads, which causes both data races and false sharing. False sharing can dramatically hurt your performance.
I'm wondering whether you can get even the correct answer, because there are data races on r. Did you get the correct answer?
However, the solution would be very simple. The operation conducted on r is reduction, which can be easily achieved by reduction clause of OpenMP.
http://msdn.microsoft.com/en-us/library/88b1k8y5(v=vs.80).aspx
Try to simply append reduction(+ : r) after #pragma omp parallel.
(Note: Additions on double are not commutative and associative. You may see some precision errors, or some differences with the result of the serial code.)