Matrix multiplication, KIJ order, Parallel version slower than non-parallel

Matrix multiplication, KIJ order, Parallel version slower than non-parallel - c++

I have a school task about paralel programming and I'm having a lot of problems with it.
My task is to create a parallel version of given matrix multiplication code and test its performence (and yes, it has to be in KIJ order):
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
This is what I came up with so far:
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
#pragma omp parallel
{
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
And that's where i found something confusing to me. This parallel version of the code is running around 50% slower than non-parallel one. The difference in speed varies only a little bit based on the matrix size (tested SIZE = 128, 256, 512, 1024, 2048, and various schedule versions - dynamic, static, w/o it at all etc. so far).
Can someone help me understand what am I doing wrong? Is it maybe because I'm using the KIJ order and it won't get any faster using openMP?
EDIT:
I'm working on a Windows 7 PC, using Visual Studio 2015 Community edition, compiling in Release x86 mode (x64 doesn't help either). My CPU is: Intel Core i5-2520M CPU # 2,50GHZ (yes, yes it's a laptop, but I'm getting same results on my home I7 PC)
I'm using global arrays:
float matrix_a[SIZE][SIZE];
float matrix_b[SIZE][SIZE];
float matrix_r[SIZE][SIZE];
I'm assigning random (float) values to matrix a and b, matrix r is filled with 0s.
I've tested the code with various matrix sizes so far (128, 256, 512, 1024, 2048 etc.). For some of them it is intended NOT to fit in cache.
My current version of code looks like this:
void multiply_matrices_KIJ()
{
#pragma omp parallel
{
for (int k = 0; k < SIZE; k++) {
#pragma omp for schedule(dynamic, 16) nowait
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
}
}
And just to be clear, I know that with different ordering of loops I could get better results but that is the thing - I HAVE TO use KIJ order. My task is to do the KIJ for loops in parallel and check the performence increase. My problem is that I expect(ed) at least a little faster execution (than the one im getting now which it between 5-10% faster at most) even though it's the I loop that is in parallel (can't do that with K loop because I will get incorrect result since it's matrix_r[i][j]).
These are the results I'm getting when using the code shown above (I'm doing calculations hundreds of times and getting the average time):
SIZE = 128
Serial version : 0,000608s
Parallel I, schedule(dynamic, 16): 0,000683s
Parallel I, schedule(static, 16): 0,000647s
Parallel J, no schedule: 0,001978s (this is where I exected
way slower execution)
SIZE = 256
Serial version: 0,005787s
Parallel I, schedule(dynamic, 16): 0,005125s
Parallel I, schedule(static, 16): 0,004938s
Parallel J, no schedule: 0,013916s
SIZE = 1024
Serial version: 0,930250s
Parallel I, schedule(dynamic, 16): 0,865750s
Parallel I, schedule(static, 16): 0,823750s
Parallel J, no schedule: 1,137000s

Note: This answer is not about how to get the best performance out of your loop order or how to parallelize it because I consider it to be suboptimal due to several reasons. I'll try to give some advice on how to improve the order (and parallelize it) instead.
Loop order
OpenMP is usually used to distribute work over several CPUs. Therefore, you want to maximize the workload of each thread while minimizing the amount of required data and information transfer.
You want to execute the outermost loop in parallel instead of the second one. Therefore, you'll want to have one of the r_matrix indices as outer loop index in order to avoid race conditions when writing to the result matrix.
The next thing is that you want to traverse the matrices in memory storage order (having the faster changing indices as the second not the first subscript index).
You can achieve both with the following loop/index order:
for i = 0 to a_rows
for k = 0 to a_cols
for j = 0 to b_cols
r[i][j] = a[i][k]*b[k][j]
Where
j changes faster than i or k and k changes faster than i.
i is a result matrix subscript and the i loop can run parallel
Rearranging your multiply_matrices_KIJ in that way gives quite a bit of a performance boost already.
I did some short tests and the code I used to compare the timings is:
template<class T>
void mm_kij(T const * const matrix_a, std::size_t const a_rows,
std::size_t const a_cols, T const * const matrix_b, std::size_t const b_rows,
std::size_t const b_cols, T * const matrix_r)
{
for (std::size_t k = 0; k < a_cols; k++)
{
for (std::size_t i = 0; i < a_rows; i++)
{
for (std::size_t j = 0; j < b_cols; j++)
{
matrix_r[i*b_cols + j] +=
matrix_a[i*a_cols + k] * matrix_b[k*b_cols + j];
}
}
}
}
mimicing your multiply_matrices_KIJ() function versus
template<class T>
void mm_opt(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
for (std::size_t i = 0; i < a_rows; ++i)
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < a_cols; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
}
implementing the above mentioned order.
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k
mm_kij(): 6.16706s.
mm_opt(): 2.6567s.
The given order also allows for outer loop parallelization without introducing any race conditions when writing to the result matrix:
template<class T>
void mm_opt_par(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
#if defined(_OPENMP)
#pragma omp parallel
{
auto ar = static_cast<std::ptrdiff_t>(a_rows);
#pragma omp for schedule(static) nowait
for (std::ptrdiff_t i = 0; i < ar; ++i)
#else
for (std::size_t i = 0; i < a_rows; ++i)
#endif
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < b_rows; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
#if defined(_OPENMP)
}
#endif
}
Where each thread writes to an individual result row
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k (4 OMP threads)
mm_kij(): 6.16706s.
mm_opt(): 2.6567s.
mm_opt_par(): 0.968325s.
Not perfect scaling but as a start faster than the serial code.

OpenMP implementations creates a thread pool (although a thread pool is not mandated by the OpenMP standard every implementation of OpenMP I have seen does this) so that threads don't have to be created and destroyed each time a parallel region is entered. Nevertheless, there is a barrier between each parallel region so that all threads have to sync. There is probably some additional overhead in the fork join model between parallel regions. So even though the threads don't have to be recreated they still have to be initialized between parallel regions. More details can be found here.
In order to avoid the overhead between entering parallel regions I suggest creating the parallel region on the outermost loop but doing the work sharing on the inner loop over i like this:
void multiply_matrices_KIJ() {
#pragma omp parallel
for (int k = 0; k < SIZE; k++)
#pragma omp for schedule(static) nowait
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
There is an implicit barrier when using #pragma omp for. The nowait clause removes the barrier.
Also make sure you compile with optimizing. There is little point in comparing performance without optimization enabled. I would use -O3.

Always keep in mind that for caching purposes, the most optimal ordering of your loops will be slowest -> fastest. In your case, that means I,K,L order. I would be quite surprised if your serial code is not automatically reordered from KIJ->IKL ordering by your compiler (assuming you have "-O3"). However, the compiler cannot do this with your parallel loop because that would break the logic you are declaring within your parallel region.
If you really truly cannot reorder your loops, then your best bet would probably be to rewrite the parallel region to encompass the largest possible loop. If you have OpenMP 4.0, you could also consider utilizing SIMD vectorization across your fastest dimension as well. However, I am still doubtful you will be able to beat your serial code by much because of the aforementioned caching issues inherent in your KIJ ordering...
void multiply_matrices_KIJ()
{
#pragma omp parallel for
for (int k = 0; k < SIZE; k++)
{
for (int i = 0; i < SIZE; i++)
#pragma omp simd
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}

Related

Parallelization of dependent nested loops

I aim to compute a simple N-body program on C++ and I am using OpenMP to speed things up with the computations. At some point, I have nested loops that look like that:
int N;
double* S = new double[N];
double* Weight = new double[N];
double* Coordinate = new double[N];
...
#pragma omp parallel for
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
S[i] += K*Weight[j];
S[j] -= K*Weight[i];
}
}
The issue here is that I do not obtain exactly the same result when removing the #pragma ... I am guessing it has to do with the fact that the second loop is dependent on the integer i, but I don't see how to get past that issue

The problem is that there is a data race during updating S[i] and S[j]. Different threads may read from/write to the same element of the array at the same time, therefore it should be an atomic operation (you have to add #pragma omp atomic) to avoid data race and to ensure memory consistency:
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
#pragma omp atomic
S[i] += K*Weight[j];
#pragma omp atomic
S[j] -= K*Weight[i];
}

Parallelizing two for loops with OpenMP in C++ does not give better performance

I have an issue with parallelizing two for loops with OpenMP in C++. I have a memberfunction CallFunction(i,j) which sets for every i and j independent member variables to a specific value and returns a weighted sum of this values. Because these functions are independent for different combinations of i and j, I want to parallelize this process. I tried it in the following way:
double optimal_value = 0;
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
if(i == j) continue;
optimal_value += CallFunction(i,j);
}
}
Above code does not have a significant effect on my runtime. I achieve almost the same runtime with and without "#pragma omp parallel for". Would it be better to write the nested loop as one loop and parallelize it? I have to idea how to make it work. Do I need further commands or settings except for activated openmp?
My system is running with a dual core cpu.
Would you please help me how I have to do it right?
Many thanks in advance!

Here is the parallelization of two loops
double optimal_value = 0;
double begin = omp_get_wtime();
#pragma omp parallel for reduction(+:optimal_value)
for (int i = 0; i < n; i++)
{
num_tr = omp_get_num_threads();
double optimal_value_in = 0.0;
#pragma omp parallel for reduction(+:optimal_value_in)
for (int j = 0; j < n; j++)
{
if((i == j)) continue;
optimal_value_in += CallFunction(i,j);
}
optimal_value += optimal_value_in;
}
double end = omp_get_wtime();
double elapsed_secs = double(end - begin);
cout<<"############# "<<"Using #Threads "<<num_tr<<endl;
cout<<"############# "<<optimal_value<<" Time For Parallel Execution :: "<<elapsed_secs<<endl;
The thing here is (also mentioned above in comments by others) ... I am not sure if you will see some speedup with just n=25 with the body of CallFunction as
double CallFunction(int i, int j){
return i*j;
}
with n=250000 and with 8 threads, I got a speed up of 4.43 so it will strongly depend on what is done in CallFunction.

Make a for loop in openmp, parallel with matrix/vector manipulations

I have this code:
scalar State::add(const int N, const int M,
vector<scalar>& flmn,
vector<scalar>& BSum,
const vector<scalar>& prev_flm,
const vector<scalar>& prev_bigsum,
const vector<scalar>& Qratio,
const int test)
{
scalar c=1;
#pragma omp parallel for
for(int i=1;i<=M;i++)
{
flmn.at(i-1) = Qratio.at(i-1)*k1+k2;
BSum.at(i-1) = someconstant + somepublicvector.at(1)*flmn.at(i-1);
c *= BSum.at(i-1);
}
return c;
}
Which at the end I am returning the variable c. When use this: "#pragma omp parallel for" it definitely won't give me consistent answer since there is always an overlap between the iterations. I wonder how such a combination of matrix or vector manipulations should be parallelized in openmp, and also I would get a consistent results as there is obviously a race condition problem in here?

for (int i = 1; i <= M; i++) {
flmn.at(i - 1) = Qratio.at(i - 1) * k1 + k2;
BSum.at(i - 1) = someconstant + somepublicvector.at(1) * flmn.at(i - 1);
c *= BSum.at(i - 1);
}
A few notes:
Don't use std::vector::at unless you really need the exception-safe indexing.
You are using the same index for each vector, so start at i=0 rather than the Fortran-style i=1.
Is M different from the sizes of the vectors being used (i.e., is it a subset)? If not, then it doesn't need to be specified.
A possible OpenMP implementation could then be
scalar c{1.0};
#pragma omp parallel
{
const std::size_t nthreads = omp_get_num_threads();
const std::size_t chunk_size = M / nthreads; // WARNING: non-even division case left to user
const std::size_t tid = omp_get_thread_num();
#pragma omp for reduction(*:c)
for (std::size_t j = 0; j < chunk_size; j++) {
const std::size_t i = j + tid * chunk_size;
flmn[i] = Qratio[i] * k1 + k2;
BSum[i] = someconstant + somepublicvector[1] * flmn[i];
c *= BSum[i];
}
}
Note that I have assumed that nthreads evenly divides M. If it does not, this case needs to be handled separately. If you are using OpenMP 4.0, then I recommend using the simd directive since the first two lines are both saxpy operations and can benefit from vectorization. For optimal performance, make sure that chunk_size is a multiple of your CPU's cacheline size.

openMP for loop increment statment handling

for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).

A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.

Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.

The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?

Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matrix multiplication, KIJ order, Parallel version slower than non-parallel - c++

Related

Parallelization of dependent nested loops

Parallelizing two for loops with OpenMP in C++ does not give better performance

Make a for loop in openmp, parallel with matrix/vector manipulations

openMP for loop increment statment handling

#pragma omp parallel for schedule crashes my program

Categories

Resources