C++ Auto-Vectorize Matrix Multiplication loop

C++ Auto-Vectorize Matrix Multiplication loop - c++

When compiling my source code which does basic matrix-matrix multiplication with auto-vectorization and auto-parallelization enabled, I receive these warnings in console:
C5002: loop not vectorized due to reason '1200'
C5012: loop not parallelized due to reason'1000'
I've read through this resource provided by MSDN which states:
Reason code 1200: Loop contains loop-carried data dependences that prevent vectorization. Different iterations of the loop interfere with each other such that vectorizing the loop would produce wrong answers, and the auto-vectorizer cannot prove to itself that there are no such data dependences.
Reason code 1000: The compiler detected a data dependency in the loop body.
I'm not sure what in my loop is causing problems. Here is the relevant portion of my source code.
// int** A, int** B, int** result, const int dimension
for (int i = 0; i < dimension; ++i) {
for (int j = 0; j < dimension; ++j) {
for (int k = 0; k < dimension; ++k) {
result[i][j] = result[i][j] + A[i][k] * B[k][j];
}
}
}
Any insight would be greatly appreciated.

The loop carried dependence is on result[i][j].
A solution to your problem would be using a temporary variable when summing up the result and do the update outside the inner-most loop like this:
for (int i = 0; i < dimension; ++i) {
for (int j = 0; j < dimension; ++j) {
auto tmp = 0;
for (int k = 0; k < dimension; ++k) {
tmp += A[i][k] * B[k][j];
}
result[i][j] = tmp;
}
}
This is going remove the dependence (since there is more read-after-write of result[i][j] and should help the vectorizer doing a better job.

Related

Efficient, parallel tensor contraction with vectorized data

The time-determining step of my code is a tensor contraction of the following form
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < no; ++i){
for(int j = 0; j < no; ++j){
X.middleCols(i*nv,nv) += Y.middleCols(j*nv,nv) * this->getIJMatrix(j,i);
}
}
where X and Y are large matrices of dimension (nx,no*nv) and the function getIJMatrix(j,i) returns the (nv*nv) matrix for index pair ij of a rank-four tensor. Also, no < nv << nx. The parallelization here is straigthforward. However, I can exploit symmetry with respect to i and j
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < no; ++i){
for(int j = i; j < no; ++j){
auto ij = this->getIJMatrix(j,i);
X.middleCols(i*nv,nv) += Y.middleCols(j*nv,nv) * ij;
if(i!=j) X.middleCols(j*nv,nv) += Y.middleCols(i*nv,nv) * ij.transpose();
}
}
leaving me with a race condition. Since X is large, using a reduction here is not feasible.
If I understand it correctly, there is no way around each thread waiting for the other ones within the inner loop. What's a good practice for this which preferably is as fast as possible?
edit: corrected obvious errors

Difference between the several ways to parallelize nested for loops in C, C++ using OpenMP

I've just started studying parallel programming with OpenMP, and there is a subtle point in the nested loop. I wrote a simple matrix multiplication code, and checked the result that is correct. But actually there are several ways to parallelize this for loop, which may be different in terms of low-level detail, and I wanna ask about it.
At first, I wrote code below, which multiply two matrix A, B and assign the result to C.
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
#pragma omp parallel for reduction(+:sum)
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
It works, but it takes really long time. And I find out that because of the location of parallel directive, it will construct the parallel region N2 time. I found it by huge increase in user time when I used linux time command.
Next time, I tried code below which also worked.
#pragma omp parallel for private(i, j, k, sum)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
And the elapsed time is decreased from 72.720s in sequential execution to 5.782s in parallel execution with the code above. And it is the reasonable result because I executed it with 16 cores.
But the flow of the second code is not easily drawn in my mind. I know that if we privatize all loop variables, the program will consider that nested loop as one large loop with size N3. It can be easily checked by executing the code below.
#pragma omp parallel for private(i, j, k)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
for(k = 0; k < N; k++)
{
printf("%d, %d, %d\n", i, j, k);
}
}
}
The printf was executed N3 times.
But in my second matrix multiplication code, there is sum right before and after the innermost loop. And It bothers me to unfold the loop in my mind easily. The third code I wrote is easily unfolded in my mind.
To summarize, I want to know what really happens behind the scene in my second matrix multiplication code, especially with the change of the value of sum. Or I'll really thank you for some recommendation of tools to observe the flow of multithreads program written with OpenMP.

omp for by default only applies to the next direct loop. The inner loops are not affected at all. This means, your can think about your second version like this:
// Example for two threads
with one thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = 0; i < N / 2; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
with the other thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = N / 2; i < N; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
You can simply all reasoning about variables with OpenMP by declaring them as locally as possible. I.e. instead of the explicit declaration use:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
int sum = 0;
for(int k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
This way you the private scope of variable more easily.
In some cases it can be beneficial to apply parallelism to multiple loops.
This is done by using collapse, i.e.
#pragma omp parallel for collapse(2)
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
You can imagine this works with a transformation like:
#pragma omp parallel for
for (int ij = 0; ij < N * N; ij++)
{
int i = ij / N;
int j = ij % N;
A collapse(3) would not work for this loop because of the sum = 0 in-between.
Now is one more detail:
#pragma omp parallel for
is a shorthand for
#pragma omp parallel
#pragma omp for
The first creates the threads - the second shares the work of a loop among all threads reaching this point. This may not be of importance for the understanding now, but there are use-cases for which it matters. For instance you could write:
#pragma omp parallel
for(int i = 0; i < N; i++)
{
#pragma omp for
for(int j = 0; j < N; j++)
{
I hope this sheds some light on what happens there from a logical point of view.

C++ vectorize double loop

I would like to vectorize a double for loop with omp simd. My Problem is of the following form:
#include <vector>
using namespace std;
#define N 8000
int main() {
vector<int> a;
vector<int> b;
vector<int> c;
a.resize(N);
b.resize(N);
c.resize(N);
#pragma omp simd collapse(2)
for (unsigned int i = 0; i < c.size(); ++i) {
for (unsigned int j = 0; j < c.size(); ++j) {
c[i] += a[i] + b[j];
}
}
}
When I compile this with g++ -O2 -fopenmp-simd -fopt-info-vec-all the vectorization report states:
note: not vectorized: not suitable for gather load _14 = *_42;
How can the code be transformed for the compiler to auto-vectorize it?
(Compiler: g++ 5.4.0, CPU supports AVX2)
UPDATE
The main problem is, as mentioned below, a data dependency of c whereby only the inner loop seems to be vectorizable. Resolving the dependency, can be achieved by switching the loops as seen below. The compiler auto-vectorized this now for me.
for (unsigned int j = 0; j < c.size(); ++j) {
#pragma omp simd
for (unsigned int i = 0; i < c.size(); ++i) {
c[i] += a[i] + b[j];
}
}

the main problem of your code is loop iteration count cannot be computed before executing the loop. you need to replace c.size() with N.
second problem is if you want vectorize outer loop,statement of c[i] = a[i] + b[j] leads to Flow and Anti dependencies. for ovecome these issues I try to vectorize inner loop and my code successfully being vectorize.
you can read more about Anti and Flow Dependencies in below page:
https://en.wikipedia.org/wiki/Data_dependency
I achieve 6.3 speed up after vectorization.
finally my code looks like below:
for (unsigned int i = 0; i < N; ++i)
{
#pragma simd
for (unsigned int j = 0; j < N; ++j)
{
c[i] = a[i] + b[j];
}
}

Identifying the Slow Down for Matrix Multiplication Algorithms

First, I understand that the are many algorithms for matrix multiplication. For this question, I will be considering the following:
Algorithm A:
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
prod[i][j] = 0;
for(k = 0; k < N; k++)
prod[i][j] += mat1[i][k] * mat2[k][j]
}
}
Algorithm B:
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
temp = 0;
for(k = 0; k < N; k++)
temp += mat1[i][k] * mat2[k][j]
prod[i][j] = temp;
}
}
A coworker and I both agree that algorithm B outperforms (faster run times than) algorithm A. However, we both have different reasoning for why B performs better.
The first theory is that because we replace C[i][j] inside the inner most loop, we perform N3 less pointer arithmetic operations, thus causing our improvement.
The second theory is that because we replace C[i][j] inside the inner most loop, we perform N3 less memory accesses, thus causing our improvement.
The question is:
How do I set up an experiment to isolate each of these factors to decide empirically which, if either, have the larger affect on run time?

Is it possible to parallelize this for loop?

I was given some code to paralellize using OpenMP and, among the various function calls, I noticed this for loop takes some good guilt on the computation time.
double U[n][n];
double L[n][n];
double Aprime[n][n];
for(i=0; i<n; i++) {
for(j=0; j<n; j++) {
if (j <= i) {
double s;
s=0;
for(k=0; k<j; k++) {
s += L[j][k] * U[k][i];
}
U[j][i] = Aprime[j][i] - s;
} else if (j >= i) {
double s;
s=0;
for(k=0; k<i; k++) {
s += L[j][k] * U[k][i];
}
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
However, after trying to parallelize it and applying some semaphores here and there (with no luck), I came to the realization that the else if condition has a strong dependency on the early if (L[j][i] being a processed number with U[i][i], which may be set on the early if), making it, in my oppinion, non-parallelizable due to race conditions.
Is it possible to parallelize this code in such a manner to make the else if only be executed if the earlier if has already completed?

Before trying to parallelize things, try simplification first.
For example, the if can be completely eliminated.
Also, the code is accessing the matrixes in a way that causes worst cache performance. That may be the real bottleneck.
Note: In update #3 below, I did benchmarks and the cache friendly version fix5, from update #2, outperforms the original by 3.9x.
I've cleaned things up in stages, so you can see the code transformations.
With this, it should be possible to add omp directives successfully. As I mentioned in my top comment, the global vs. function scope of the variables affects the type of update that may be required (e.g. omp atomic update, etc.)
For reference, here is your original code:
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (j <= i) {
double s;
s = 0;
for (k = 0; k < j; k++) {
s += L[j][k] * U[k][i];
}
U[j][i] = Aprime[j][i] - s;
}
else if (j >= i) {
double s;
s = 0;
for (k = 0; k < i; k++) {
s += L[j][k] * U[k][i];
}
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
}
The else if (j >= i) was unnecessary and could be replaced with just else. But, we can split the j loop into two loops so that neither needs an if/else:
// fix2.c -- split up j's loop to eliminate if/else inside
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[j][k] * U[k][i];
U[j][i] = Aprime[j][i] - s;
}
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[j][k] * U[k][i];
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
U[i][i] is invariant in the second j loop, so we can presave it:
// fix3.c -- save off value of U[i][i]
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[j][k] * U[k][i];
U[j][i] = Aprime[j][i] - s;
}
double Uii = U[i][i];
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[j][k] * U[k][i];
L[j][i] = (Aprime[j][i] - s) / Uii;
}
}
The access to the matrixes is done in probably the worst way for cache performance. So, if the assignment of dimensions can be flipped, a substantial savings in memory access can be achieved:
// fix4.c -- transpose matrix coordinates to get _much_ better memory/cache
// performance
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[k][j] * U[i][k];
U[i][j] = Aprime[i][j] - s;
}
double Uii = U[i][i];
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[k][j] * U[i][k];
L[i][j] = (Aprime[i][j] - s) / Uii;
}
}
UPDATE:
In the Op's first k-loop its k<j and in the 2nd k<i don't you have to fix that?
Yes, I've fixed it. It was too ugly a change for fix1.c, so I removed that and applied the changes to fix2-fix4 where it was easy to do.
UPDATE #2:
These variables are all local to the function.
If you mean they are function scoped [without static], this says that the matrixes can't be too large because, unless the code increases the stack size, they're limited to the stack size limit (e.g. 8 MB)
Although the matrixes appeared to be VLAs [because n was lowercase], I ignored that. You may want to try a test case using fixed dimension arrays as I believe they may be faster.
Also, if the matrixes are function scope, and want to parallelize things, you'd probably need to do (e.g.) #pragma omp shared(Aprime) shared(U) shared(L).
The biggest drag on cache were the loops to calculate s. In fix4, I was able to make access to U cache friendly, but L access was poor.
I'd need to post a whole lot more if I did include the external context
I guessed as much, so I did the matrix dimension swap speculatively, not knowing how much other code would need changing.
I've created a new version that changes the dimensions on L back to the original way, but keeping the swapped versions on the other ones. This provides the best cache performance for all matrixes. That is, the inner loop for most matrix access is such that each iteration is incrementing along the cache lines.
In fact, give it a try. It may improve things to the point where parallel isn't needed. I suspect the code is memory bound anyway, so parallel might not help as much.
// fix5.c -- further transpose to fix poor performance on s calc loops
//
// flip the U dimensions back to original
double U[n][n];
double L[n][n];
double Aprime[n][n];
double *Up;
double *Lp;
double *Ap;
for (i = 0; i < n; i++) {
Ap = Aprime[i];
Up = U[i];
for (j = 0; j <= i; j++) {
double s = 0;
Lp = L[j];
for (k = 0; k < j; k++)
s += Lp[k] * Up[k];
Up[j] = Ap[j] - s;
}
double Uii = Up[i];
for (; j < n; j++) {
double s = 0;
Lp = L[j];
for (k = 0; k < i; k++)
s += Lp[k] * Up[k];
Lp[i] = (Ap[j] - s) / Uii;
}
}
Even if you really need the original dimensions, depending upon the other code, you might be able to transpose going in and transpose back going out. This would keep things the same for other code, but, if this code is truly a bottleneck, the extra transpose operations might be small enough to merit this.
UPDATE #3:
I've run benchmarks on all the versions. Here are the elapsed times and ratios relative to original for n equal to 1037:
orig: 1.780916929 1.000x
fix1: 3.730602026 0.477x
fix2: 1.743769884 1.021x
fix3: 1.765769482 1.009x
fix4: 1.762100697 1.011x
fix5: 0.452481270 3.936x
Higher ratios are better.
Anyway, this is the limit of what I can do. So, good luck ...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Auto-Vectorize Matrix Multiplication loop - c++

Related

Efficient, parallel tensor contraction with vectorized data

Difference between the several ways to parallelize nested for loops in C, C++ using OpenMP

C++ vectorize double loop

Identifying the Slow Down for Matrix Multiplication Algorithms

Is it possible to parallelize this for loop?

Categories

Resources