How to obtain performance enhancement while multiplying two sub-matrices? - c++

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.
#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.
What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?
You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.

For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.
Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.

I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.
To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.
So, something like this:
select sub-matrix A
#pragma omp parallel for
select sub-matrix B
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
could work faster since whole data always stays in same thread (core).

Related

What is the most efficient way to repeat elements in a vector and apply a set of different functions across all elements using Eigen?

Say I have a vector containing only positive, real elements defined like this:
Eigen::VectorXd v(1.3876, 8.6983, 5.438, 3.9865, 4.5673);
I want to generate a new vector v2 that has repeated the elements in v some k times. Then I want to apply k different functions to each of the repeated elements in the vector.
For example, if v2 was v repeated 2 times and I applied floor() and ceil() as my two functions, the result based on the above vector would be a column vector with values: [1; 2; 8; 9; 5; 6; 3; 4; 4; 5]. Preserving the order of the original values is important here as well. These values are also a simplified example, in practice, I'm generating vectors v with ~100,000 or more elements and would like to make my code as vectorizable as possible.
Since I'm coming to Eigen and C++ from Matlab, the simplest approach I first took was to just convert this Nx1 vector into an Nx2 matrix, apply floor to the first column and ceil to the second column, take the transpose to get a 2xN matrix and then exploit the column-major nature of the matrix and reshape the 2xN matrix into a 2Nx1 vector, yielding the result I want. However, for large vectors, this would be very slow and inefficient.
This response by ggael effectively addresses how I could repeat the elements in the input vector by generating a sequence of indices and indexing the input vector. I could just then generate more sequences of indices to apply my functions to the relevant elements v2 and copy the result back to their respective places. However, is this really the most efficient approach? I dont fully grasp copy-on-write and move semantics, but I think the second indexing expressions would be in a sense redundant?
If that is true, then my guess is that a solution here would be some sort of nullary or unary expression where I could define an expression that accepts the vector, some index k and k expressions/functions to apply to each element and spits out the vector I'm looking for. I've read the Eigen documentation on the subject, but I'm struggling to build a functional example. Any help would be appreciated!
So, if I understand you correctly, you don't want to replicate (in terms of Eigen methods) the vector, you want to apply different methods to the same elements and store the result for each, correct?
In this case, computing it sequentially once per function is the easiest route. Most CPUs can only do one (vector) memory store per clock cycle, anyway. So for simple unary or binary operations, your gains have an upper bound.
Still, you are correct that one load is technically always better than two and it is a limitation of Eigen that there is no good way of achieving this.
Know that even if you manually write a loop that would generate multiple outputs, you should limit yourself in the number of outputs. CPUs have a limited number of line-fill buffers. IIRC Intel recommended using less than 10 "output streams" in tight loops, otherwise you could stall the CPU on those.
Another aspect is that C++'s weak aliasing restrictions make it hard for compilers to vectorize code with multiple outputs. So it might even be detrimental.
How I would structure this code
Remember that Eigen is column-major, just like Matlab. Therefore use one column per output function. Or just use separate vectors to begin with.
Eigen::VectorXd v = ...;
Eigen::MatrixX2d out(v.size(), 2);
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
Following the KISS principle, this is good enough. You will not gain much if anything by doing something more complicated. A bit of multithreading might gain you something (less than factor 2 I would guess) because a single CPU thread is not enough to max out memory bandwidth but that's about it.
Some benchmarking
This is my baseline:
int main()
{
int rows = 100013, repetitions = 100000;
Eigen::VectorXd v = Eigen::VectorXd::Random(rows);
Eigen::MatrixX2d out(rows, 2);
for(int i = 0; i < repetitions; ++i) {
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
}
}
Compiled with gcc-11, -O3 -mavx2 -fno-math-errno I get ca. 5.7 seconds.
Inspecting the assembler code finds good vectorization.
Plain old C++ version:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
for(std::ptrdiff_t j = 0; j < rows; ++j) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
40 seconds instead of 5! This version actually does not vectorize because the compiler cannot prove that the arrays don't alias each other.
Next, let's use fixed size Eigen vectors to get the compiler to generate vectorized code:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
std::ptrdiff_t j;
for(j = 0; j + 4 <= rows; j += 4) {
const Eigen::Vector4d vj = Eigen::Vector4d::Map(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector4d::Map(outfloor + j) = floorval;
Eigen::Vector4d::Map(outceil + j) = ceilval;;
}
if(j + 2 <= rows) {
const Eigen::Vector2d vj = Eigen::Vector2d::MapAligned(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector2d::Map(outfloor + j) = floorval;
Eigen::Vector2d::Map(outceil + j) = ceilval;;
j += 2;
}
if(j < rows) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
7.5 seconds. The assembler looks fine, fully vectorized. I'm not sure why performance is lower. Maybe cache line aliasing?
Last attempt: We don't try to avoid re-reading the vector but we re-read it blockwise so that it will be in cache by the time we read it a second time.
const int blocksize = 64 * 1024 / sizeof(double);
std::ptrdiff_t j;
for(j = 0; j + blocksize <= rows; j += blocksize) {
const auto& vj = v.segment(j, blocksize);
auto outj = out.middleRows(j, blocksize);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
}
const auto& vj = v.tail(rows - j);
auto outj = out.bottomRows(rows - j);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
5.4 seconds. So there is some gain here but not nearly enough to justify the added complexity.

Efficient Tensor Multiplication

I have a Matrix that is a representation of a higher dimensional tensor which could in principle be N dimensional but each dimension is the same size. Lets say I want to compute the following:
and C is stored as a matrix via
where there is some mapping from ij to I and kl to J.
I can do this with nested for loops where each dimension of my tensor is of size 3 via
for (int i=0; i<3; i++){
for (int j=0; j<3; j++){
I = map_ij_to_I(i,j);
for (int k=0; k<3; k++){
for (int l=0; l<3; l++){
J = map_kl_to_J(k,l);
D(I,J) = 0.;
for (int m=0; m<3; m++){
for (int n=0; n<3; n++){
M = map_mn_to_M(m,n);
D(I,J) += a(i,m)*C(M,J)*b(j,n);
}
}
}
}
}
}
but that's pretty messy and not very efficient. I'm using the Eigen matrix library so I suspect there is probably a much better way to do this than either a for loop or coding each entry separately. I've tried the unsupported tensor library and found it was slower than my explicit loops. Any thoughts?
As a bonus question, how would I compute something like the following efficiently?
There's a lot of work that the optimizer of your compiler will do for you under the hood. For once, loops with constant number of iterations are unrolled. That may be the reason why your code is faster than the library.
I would suggest to take a look at the assembly produced with the optimizations turned to really get a grasp on where you can optimize and how really your program looks like once compiled.
Then of course, you can think about parallel implementations either on the CPU (multiple threads) or on GPU (cuda, OpenCL, OpenAcc, etc).
As for the bonus question, if you think about writing it as two nested loops, I would suggest to rearrange the expression so that the a_km term is between the two sums. No need to perform that multiplication inside the inner sum as it doesn't depend on n. Although this will probably give only a slight performance benefit in modern CPUs...

Is this a good practice for vectorization?

I'm trying to improve performance from this code by vectorizing this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
From my knowledge, you can vectorize loops that involves exactly one math operation. In the code above we have 5 math operations, so (using OMP):
#pragma omp simd
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
Isn't gonna work. However, I was thinking if breaking the loop above into multiple loops with exactly one math operation is a good practice for vectorization? The resulting code would be:
double p0[n], p3[n], p1[n], p2[n];
#pragma omp simd
for( int k = 0; k < n; k++ )
p0[k] = origin[f[k].p0]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p3[k] = origin[f[k].p3]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p1[k] = origin[f[k].p1]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p2[k] = origin[f[k].p2]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p0[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p1[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p2[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p3[k];
Is this a good solution, or there is any better? Modern compilers (say gcc) are going to do this (or better) kind of optimizations (e.g. enabling -O3) by themselves (so there is actually no gain in performance)?
This is generally bad HPC programming practice by several reasons:
These days you normally have to make your code as computationally dense as possible. To achieve that you need to have the higher Arithmetic Intensity (AI) for the loop whenever you can. For simplicity think of AI as ration of [amount of computations] divided by [number of bytes moved to/from memory in order to perform these computations].
By splitting loops you make AI for each loop much lower in your case, because you do not reuse the same bytes for different computations anymore.
(Vector- or Thread-) -Parallel Reduction (in your case by variable "d") has cost/overhead which you don't want to multiply by 8 or 10 (i.e. by number of loops produced by you by hands).
In many cases Intel/CGG compiler vectorization engines can make slightly better optimization when have several data fields of the same data object processed in the same loop body as opposed to splitted loops case.
There are few theoretical advantages of loop splitting as well, but they don't apply to your case, so I provided them just in case. Loop splitting is reasonable/profitable when:
There is more than one loop carried dependency or reduction in the
same loop.
In case loop split helps to get rid of some out of order execution
data flow dependencies.
In case you do not have enough vector registers to perform all the computations (for complicated algorithms).
Intel Advisor (mentioned by you in prev questons) helps to analyze many of these factors and measures AI.
This is also true that good compilers "don't care" whenever you have one such loop or loop-split, because they could easily transform one case to another or vice versa on the fly.
However the applicability for this kind of transformation is very limited in real codes, because in order to do that you have to know in compile-time a lot of extra info: whenever pointers or dynamic arrays overlap or not, whenever data is aligned or not etc etc. So you shouldn't rely on compiler transformations and specific compiler minor version, but just write HPC-ready code as much as you are capable to.

SpeedUp Matrix Multiplication

I have to do a matrix boolean multiplication of a matrix with itself in a C++ program and I want to optimize it.
The matrix is symmetric so I think to do a row by row multiplication to reduce cache misses.
I allocated space for matrix in this way:
matrix=new bool*[dimension];
for (i=0; i<dimension; i++) {
matrix[i]=new bool[dimension];
}
And the multiplication is the following:
for (m=0; m<dimension; m++) {
for (n=0; n<dimension; n++) {
for (k=0; k<dimension; k++) {
temp=mat[m][k] && mat[n][k];
B[m][n]= B[m][n] || temp;
...
I did some test of computation time with this version and with another version whit a row by column multiplication like this
for (m=0; m<dimension; m++) {
for (n=0; n<dimension; n++) {
for (k=0; k<dimension; k++) {
temp=mat[m][k] && mat[k][n];
B[m][n]= B[m][n] || temp;
...
I did tests on a 1000x1000 matrix The result showed that the second version ( row by column ) is faster the previous one.
Could you show me why? Shouldn't The misses in the first algorithm be less ?
In the first multiplication approach the rows of the boolean matrices are stored consecutively in memory and also accessed consecutively so that prefetching works flawlessly. In the second approach the cacheline fetched when accessing the element (n,0) can already be evicted when accessing (n+1,0). Whether this actually happens depends on the architecture and its cache hierarchy properties you run your code on. On my machine the first approach is indeed faster for large enough matrices.
As for speeding up the computations: Do not use logical operators since they are evaluated lazy and thus branch misprediction can occur. The inner loop can be exited early as soon as B[m][n] becomes true. Instead of using booleans you might want to consider using the bits of say integers. That way you can combine 32 or 64 elements in your inner loop at once and possibly use vectorization. If your matrices are rather sparse then you might want to consider switching to sparse matrix data structures. Also changing the order of the loops can help as well as introducing blocking. However, any performance optimization is specific to an architecture and class of input matrices.
Speeding suggestion. In the inner loop:
Bmn = false;
for (k=0; k<dimension; k++) {
if ((Bmn = mat[m][k] && mat[k][n])) {
k = dimension; // exit for-k loop
}
}
B[m][n]= Bmn

compiler nested loop optimization for sequential memory access.

I came across a strange performance issue in a matrix multiply benchmark (matrix_mult in Metis from the MOSBENCH suite). The benchmark was optimized to tile the data such that the active working set was 12kb (3 tiles of 32x32 ints) and would fit into the L1 cache. To make a long story short, swapping the inner two most loops had a performance difference of almost 4x on certain array input sizes (4096, 8192) and about a 30% difference on others. The problem essentially came down to accessing elements sequentially versus in a stride pattern. Certain array sizes I think created a bad stride access that generated a lot cache line collisions. The performance difference is noticeably less when changing from 2-way associative L1 to an 8-way associative L1.
My question is why doesn't gcc optimize loop ordering to maximize sequential memory accesses?
Below is a simplified version of the problem (note that performance times are highly dependent on L1 configuration. The numbers indicated below are from 2.3 GHZ AMD system with 64K L1 2-way associative compiled with -O3).
N = ARRAY_SIZE // 1024
int* mat_A = (int*)malloc(N*N*sizeof(int));
int* mat_B = (int*)malloc(N*N*sizeof(int));
int* mat_C = (int*)malloc(N*N*sizeof(int));
// Elements of mat_B are accessed in a stride pattern of length N
// This takes 800 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int b = 0; b < 32; b++)
for (int c = 0; c < 32; c++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
// Inner two loops are swapped
// Elements are now accessed sequentially in inner loop
// This takes 172 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int c = 0; c < 32; c++)
for (int b = 0; b < 32; b++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
gcc might not be able to prove that the pointers don't overlap. If you are fine using non standard extensions you could try using __restrict.
gcc doesn't take full advantage of your architecture to avoid the necessity to recompile for every processor. Using the option -march with the appropriate value for your system might help.
gcc has a bunch of optimizations that just do what you want.
Look up the -floop-strip-mine and -floop-block compiler options.
Quote from the manual:
Perform loop blocking transformations on loops. Blocking strip mines
each loop in the loop nest such that the memory accesses of the
element loops fit inside caches. The strip length can be changed using
the loop-block-tile-size parameter.