Matrix Multiplication optimization via matrix transpose - c++

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?
Full Source on GitHub
void transpose(int size, matrix m) {
int i, j;
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
std::swap(m.element[i][j], m.element[j][i]);
}
void mm(matrix a, matrix b, matrix result) {
int i, j, k;
int size = a.size;
long long before, after;
before = wall_clock_time();
// Do the multiplication
transpose(size, b); // transpose the matrix to reduce cache miss
for (i = 0; i < size; i++)
for (j = 0; j < size; j++) {
int tmp = 0; // save memory writes
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
}
after = wall_clock_time();
fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}
Am I doing things right so far?
FYI: The next optimization I need to do is use SIMD/Intel SSE3

Am I doing things right so far?
No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.
Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.
Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.
Addendum
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.
Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a2.
Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.
Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.

If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read
j=0; j<size; j++
but
j=0; j<i; j++
Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:
for (int i=0; i<size; i++)
{
for (int k=0; k<size; k++)
{
double tmp = a[i][k];
for (int j=0; j<size; j++)
{
result[i][j] += tmp * b[k][j];
}
}
}

Related

Why is bool [][] more efficient than vector<vector<bool>>

I am trying to solve a DP problem where I create a 2D array, and fill the 2D array all the way. My function is called multiple times with different test cases. When I use a vector>, I get a time limit exceeded error (takes more than 2 sec for all the test cases). However, when I use bool [][], takes much less time (about 0.33 sec), and I get a pass.
Can someone please help me understand why would vector> be any less efficient than bool [][].
bool findSubsetSum(const vector<uint32_t> &input)
{
uint32_t sum = 0;
for (uint32_t i : input)
sum += i;
if ((sum % 2) == 1)
return false;
sum /= 2;
#if 1
bool subsum[input.size()+1][sum+1];
uint32_t m = input.size()+1;
uint32_t n = sum+1;
for (uint32_t i = 0; i < m; ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < n; ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < m; ++i) {
for (uint32_t j = 1; j < n; ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum[m-1][n-1];
#else
vector<vector<bool>> subsum(input.size()+1, vector<bool>(sum+1));
for (uint32_t i = 0; i < subsum.size(); ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < subsum[0].size(); ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < subsum.size(); ++i) {
for (uint32_t j = 1; j < subsum[0].size(); ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum.back().back();
#endif
}
Thank you,
Ahmed.
If you need a matrix and you need to do high performance stuff, it is not always the best solution to use a nested std::vector or std::array because these are not contiguous in memory. Non contiguous memory access results in higher cache misses.
See more :
std::vector and contiguous memory of multidimensional arrays
Is the data in nested std::arrays guaranteed to be contiguous?
On the other hand bool twoDAr[M][N] is guarenteed to be contiguous. It ensures less cache misses.
See more :
C / C++ MultiDimensional Array Internals
And to know about cache friendly codes:
What is “cache-friendly” code?
Can someone please help me understand why would vector> be any less
efficient than bool [][].
A two-dimensional bool array is really just a big one-dimensional bool array of size M * N, with no gaps between the items.
A two-dimensional std::vector doesn't exist; is not just one big one-dimensional std::vector but a std::vector of std::vectors. The outer vector itself has no memory gaps, but there may well be gaps between the content areas of the individual vectors. It depends on how your compiler implements the very special std::vector<bool> class, but if your element count is sufficiently big, then dynamic allocation is unavoidable to prevent a stack overflow, and that alone implies pointers to separated memory areas.
And once you need to access data from separated memory areas, things become slower.
Here is a possible solution:
Try to use a std::vector<bool> of size (input.size() + 1) * (sum + 1).
If that fails to make things faster, avoid the template specialisation by using a std::vector<char> of size (input.size() + 1) * (sum + 1), and cast the elements to and from bool as required.
In cases where you know the input size of the array from the beginning, using arrays will always be faster than Vector or equal. Because vector is a wrapper around array, therefore a higher-level implementation. It's benefit is allocating extra space for you when needed, which you don't need if you have a fixed size of elements.
If you had problem where you needed 1D arrays, the difference may not have bothered you(where you have a single vector, and a single array). But creating a 2D array, you also create many instances of the vector class, so the time difference between array and vector is multiplied by the number of elements you have in your container, making your code slow.
This time difference has many causes behind it, but the most obvious one is of course calling the vector constructor. You are calling a function subsum.size() times. The memory issue mentioned by other answers is another cause.
For performance, it is advised to use array's whenever you can in your code. Even if you need to use a vector, you should try to minimize the number of re-size's done by the vector(reserving, pre-allocating), achieving a closer implementation to array.

filling only half of Matrix using OpenMp in C++

I have a quite big matrix. I would like to fill half of the matrix in parallel.
m_matrix is 2D std vector. Any suggestion for the type of container is appreciated as well. What _fill(i,j) function is doing is not considered heavy compared to size of the matrix.
//i: row
//j: column
for (size_t i=1; i<num_row; ++i)
{
for (size_t j=0; j<i; ++j)
{
m_matrix[i][j] = _fill(i, j);
}
}
What would be a nice openMP structure for that? I tried dynamic strategy bet I got even time increase compared to the sequential mode.

Fastest way to calculate the abs()-values of a complex array

I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.
Given that all loop iterations are independent, you can use the following code for parallelization:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.
Or use Concurrency::parallele_for like that :
Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});
Use vector operations.
If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.
Libmvec is vector math library added in Glibc 2.22.
Vector math library was added to support SIMD constructs of OpenMP4.0
(#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding
vector implementations of vector math functions.
Vector math functions are vector variants of corresponding scalar math
operations implemented using SIMD ISA extensions (e.g. SSE or AVX for
x86_64). They take packed vector arguments, perform the operation on
each element of the packed vector argument, and return a packed vector
result. Using vector math functions is faster than repeatedly calling
the scalar math routines.
Also, see Parallel for vs omp simd: when to use each?
If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:
Description
These functions evaluate the function hypot(x, y) for an entire vector
of values at once. ...
The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.
Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use
//struct of arrays of four complex numbers
struct c4 {
float x[4]; // real values of four complex numbers
float y[4]; // imaginary values of four complex numbers
};
Here is code showing how you could do this with SSE for the x86 instruction set.
#include <stdio.h>
#include <x86intrin.h>
#define N 10
struct c4{
float x[4];
float y[4];
};
static inline void cabs_soa4(struct c4 *a, float *b) {
__m128 x4 = _mm_loadu_ps(a->x);
__m128 y4 = _mm_loadu_ps(a->y);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
int main(void)
{
int n4 = ((N+3)&-4)/4; //choose next multiple of 4 and divide by 4
printf("%d\n", n4);
struct c4 a[n4]; //array of struct of arrays
for(int i=0; i<n4; i++) {
for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
}
float b[4*n4];
for(int i=0; i<n4; i++) {
cabs_soa4(&a[i], &b[4*i]);
}
for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}
It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this
for(int i = 0; i < nchunks; i++) {
for(int j = 0; j < chunk_size; j++) {
b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
}
foo(&b[i*chunck_size]); // foo is computationally intensive.
}
I did not implement an array of struct of array here but it should be easy to adjust the code for that.
If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.
So, if you want to run them in parallel you would do:
#include <cilk/cilk.h>
cilk_for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
or if you want to test SIMD:
#pragma simd
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But, the nicest part of Cilk is that you can just do:
b[:] = cabs(a[:])
In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel).
Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.
Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:
#include <future>
...
int main()
{
...
// Create async calculations
std::future<void> *futures = new std::future<void>[N];
for (int i = 0; i < N; ++i)
{
futures[i] = std::async([&a, &b, i]
{
b[i] = std::sqrt(a[i]);
});
}
// Wait for calculation of all async procedures
for (int i = 0; i < N; ++i)
{
futures[i].get();
}
...
return 0;
}
IdeOne live code
We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com

How to parallelize a loop?

I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.
for(i=2;i<N;i++)
for(j=2;j<N;j++)
A[i,j] =A[i-2,j] +A[i,j-2];
Code:
int const N = 10;
int arr[N][N];
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
arr[i][j] = 1;
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
printf_s("%d ",arr[i][j]);
printf("\n");
}
Do you have any suggestions how I can do it? Thank you!
serial and parallel run will give different. result because in
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
.....
you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!
This
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i and leave them to it. There will be no determinism in the relative order in which threads update arr. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100 (or whatever) and thread 2 is updating elements with i = 102,103,104,...,200 the program does not determine whether thread 1 updates arr[i,:] = 100 before or after thread 2 wants to use the updated values in arr. You have written a code with a classic data race.
You have a number of options to fix this:
You could tie yourself in knots trying to ensure that the threads update arr in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.
You could make 2 copies of arr and always update from one to the other, then from the other to the one. Something like (very pseudo-code)
for ...
{
old = 0
new = 1
arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];
old = 1
new = 0
}
Of course, this second approach trades space for time but that's often a reasonable trade-off.
You may find that adding an extra plane to arr doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old] the first index element rather than the last.
Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.
If any other options occur to me I'll edit them in.
To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.
Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).
A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:
if( too small to be worth parallelizing ) {
do serially
} else {
// Recursively:
Do upper left quadrant
Do lower left and upper right quadrants in parallel
Do lower right quadrant
}
As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.
If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without
changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))
Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.
To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).

triangular matrix conversion and auto parallelization

I'm playing a bit with auto parallelization in ICC (11.1; old, but can't do anything about it) and I'm wondering why the compiler can't parallelize the inner loop for a simple gaussian elimination:
void makeTriangular(float **matrix, float *vector, int n) {
for (int pivot = 0; pivot < n - 1; pivot++) {
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
for (int row = pivot + 1; row < n; row++) { // line 72; should be parallelized
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) { // line 74
matrix[row][col] -= matrix[pivot][col] * tmp;
}
vector[row] -= vector[pivot] * tmp;
}
}
}
We're only writing to the arrays dependent on the private row (and col) variable and row is guaranteed to be larger than pivot, so it should be obvious to the compiler that we aren't overwriting anything.
I'm compiling with -O3 -fno-alias -parallel -par-report3 and get lots of dependencies ala: assumed FLOW dependence between matrix line 75 and matrix line 73. or assumed ANTI dependence between matrix line 73 and matrix line 75. and the same for line 75 alone. What problem does the compiler have? Obviously I could tell it exactly what to do with some pragmas, but I want to understand what the compiler can get alone.
Basically the compiler can't figure out that there's no dependency due to the name matrix and the name vector being both read from and written too (even though with different regions). You might be able to get around this in the following fashion (though slightly dirty):
void makeTriangular(float **matrix, float *vector, int n)
{
for (int pivot = 0; pivot < n - 1; pivot++)
{
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
float **matrixForWriting = matrix; // COPY THE POINTER
float *vectorForWriting = vector; // COPY THE POINTER
// (then parallelize this next for loop as you were)
for (int row = pivot + 1; row < n; row++) {
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) {
// WRITE TO THE matrixForWriting VERSION
matrixForWriting[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
}
// WRITE TO THE vectorForWriting VERSION
vectorForWriting[row] = vector[row] - vector[pivot] * tmp;
}
}
}
Bottom line is just give the ones you're writing to a temporarily different name to trick the compiler. I know that it's a little dirty and I wouldn't recommend this kind of programming in general. But if you're sure that you have no data dependency, it's perfectly fine.
In fact, I'd put some comments around it that are very clear to future people who see this code that this was a workaround, and why you did it.
Edit: I think the answer was basically touched on by #FPK and an answer was posted by #Evgeny Kluev. However, in #Evgeny Kluev's answer he suggests making this an input parameter and that might parallelize but won't give the correct value since the entries in matrix won't be updated. I think the code I posted above will give the correct answer too.
The same auto-parallelization problem is on icc 12.1. So I used this newer version for experiments.
Adding an output matrix to your function's parameter list and changing body of the third loop to this
out[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
fixed the "FLOW dependence" problem. Which means, "-fno-alias" affects only function parameters, while contents of the single parameter remain under suspicion of being aliased. I don't know why this option does not affect everything. Since different parts of your matrix do not really alias each other, you can just leave this additional parameter to the function and pass the same matrix through this parameter.
Interestingly, while complaining about 'matrix', compiler say nothing about 'vector', which really has aliasing problems: this line vector[row] -= vector[pivot] * tmp; may lead to false aliasing (writing to vector[row] in one thread may touch the cache line, storing vector[pivot], used by every thread).
"FLOW dependence" is not the only problem in this code. After it was fixed, compiler still refuses to parallelize second and third loops because of "insufficient computational work". So I tried to give it some extra work:
float tmp = matrix[row][pivot] * pivotVal;
...
out[row][col] = matrix[row][col] - matrix[pivot][col] *tmp /pivotVal /pivotVal;
And after all this, the second loop was at last parallelized, though I'm not sure if it gained any speed improvement.
Update: I found a better alternative to giving computer "some extra work". Option -par-threshold50 does the trick.
I have no access to an icc to test my idea but I suspect the compiler fears aliasing: matrix is defined as float**: an array of pointers pointing to arrays of floats. All those pointers could point to the same float array so parallizing this would be very dangerous. This would make no sense, but the compiler cannot know.