Reduction and collapse clauses in OMP have some confusing points

Reduction and collapse clauses in OMP have some confusing points - c++

Both of reduction and collapse clauses in OMP confuses me,
some points raised popped into my head
Why reduction doesn't work with minus? as in the limitation listed here
Is there any work around to achieve minus?
How does a unary operator work, i.e. x++ or x--? is the -- or ++ applied to each partial result? or only once at the creation of the global result? both cases are totally different.
About the collapse..
could we apply collapse on a nested loops but have some lines of code in between
for example
for (int i = 0; i < 4; i++)
{
cout << "Hi"; //This is an extra line. which breaks the 2 loops.
for (int j = 0; j < 100; j++)
{
cout << "*";
}
}

1 & 2. For minus, what are you subtracting from? If you have two threads, do you do result_thread_1 - result_thread_2, or result_thread_2 - result_thread_1? If you have more than 2 threads, then it gets even more confusing: Do I only have one negative term and all others are positive? Is there only one positive term and others are negative? Is it a mix? Which results are which? As such, no, there is no workaround.
In the event of x++ or x--, assuming that they are within the reduction loop, they should happen to each partial result.
Yes, I believe so.

The reduction clause requires that the operation is associative and the x = a[i] - x operation in
for(int i=0; i<n; i++) x = a[i] - x;
is not associative. Try a few iterations.
n = 0: x = x0;
n = 1: x = a[0] - x0;
n = 2: x = a[1] - (a[0] - x0)
n = 3: x = a[2] - (a[1] - (a[0] - x0))
= a[2] - a[1] + a[0] - x0;
But x = x - a[i] does work e.g.
n = 3: x = x0 - (a[2] + a[1] + a[0]);
However there is a workaround. The sign alternates every other term. Here is a working solution.
#include <stdio.h>
#include <omp.h>
int main(void) {
int n = 18;
float x0 = 3;
float a[n];
for(int i=0; i<n; i++) a[i] = i;
float x = x0;
for(int i=0; i<n; i++) x = a[i] - x; printf("%f\n", x);
int sign = n%2== 0 ? -1 : 1 ;
float s = -sign*x0;
#pragma omp parallel
{
float sp = 0;
int signp = 1;
#pragma omp for schedule(static)
for(int i=0; i<n; i++) sp += signp*a[i], signp *= -1;
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++)
#pragma omp ordered
s += sign*sp, sign *= signp;
}
printf("%f\n", s);
}
Here is a simpler version which uses the reduction clause. The thing to notice is that the odd terms are all one sign and the even terms another. So if we do the reduction two terms at a time the sign does not change and the operation is associative.
x = x0;
for(int i=0; i<n; i++) x = a[i] - x
can be reduced in parallel like this.
x = n%2 ? a[0] - x0 : x0;
#pragma omp parallel for reduction (+:x)
for(int i=0; i<n/2; i++) x += a[2*i+1+n%2] - a[2*i+n%2];

Related

Is there a way to parallelize a lower triangle matrix solver?

The goal is to add OpenMP parallelization to for (i = 0; i < n; i++) for the lower triangle solver for the form Ax=b. Expected result is exactly same as the result when there is NO parallelization added to for (i = 0; i < n; i++).
vector<vector<double>> represents a 2-D matrix. makeMatrix(int m, int n) initializes a vector<vector<double>> of all zeroes of size mxn.
Two of the most prominent tries have been left in comments.
vector<vector<double>> lowerTriangleSolver(vector<vector<double>> A, vector<vector<double>> b)
{
vector<vector<double>> x = makeMatrix(A.size(), 1);
int i, j;
int n = A.size();
double s;
//#pragma omp parallel for reduction(+: s)
//#pragma omp parallel for shared(s)
for (i = 0; i < n; i++)
{
s = 0.0;
#pragma omp parallel for
for (j = 0; j < i; j++)
{
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}
return x;
}

You could try to assign the outer loop iterations among threads, instead of the inner loop. In this way, you increase the granularity of the parallel tasks and avoid the reduction of the 's' variable.
#pragma omp parallel for
for (int i = 0; i < n; i++){
double s = 0.0;
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}
Unfortunately, that is not possible because there is a dependency between s = s + A[i][j] * x[j][0]; and x[i][0] = (b[i][0] - s) / A[i][i];, more precisely x[j][0] depends upon the x[i][0].
So you can try two approaches:
for (int i = 0; i < n; i++){
double s = 0.0;
#pragma omp parallel for reduction(+:s)
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}
or using SIMD :
for (int i = 0; i < n; i++){
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j = 0; j < i; j++){
s = s + A[i][j] * x[j][0];
}
x[i][0] = (b[i][0] - s) / A[i][i];
}

Efficient, parallel tensor contraction with vectorized data

The time-determining step of my code is a tensor contraction of the following form
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < no; ++i){
for(int j = 0; j < no; ++j){
X.middleCols(i*nv,nv) += Y.middleCols(j*nv,nv) * this->getIJMatrix(j,i);
}
}
where X and Y are large matrices of dimension (nx,no*nv) and the function getIJMatrix(j,i) returns the (nv*nv) matrix for index pair ij of a rank-four tensor. Also, no < nv << nx. The parallelization here is straigthforward. However, I can exploit symmetry with respect to i and j
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < no; ++i){
for(int j = i; j < no; ++j){
auto ij = this->getIJMatrix(j,i);
X.middleCols(i*nv,nv) += Y.middleCols(j*nv,nv) * ij;
if(i!=j) X.middleCols(j*nv,nv) += Y.middleCols(i*nv,nv) * ij.transpose();
}
}
leaving me with a race condition. Since X is large, using a reduction here is not feasible.
If I understand it correctly, there is no way around each thread waiting for the other ones within the inner loop. What's a good practice for this which preferably is as fast as possible?
edit: corrected obvious errors

openmp shared or nothing. private vs uninitialized

Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
and the same code but line 4 doesn't have the shared(sum) because sum is already initialized?
#pragma omp parallel for
for(int = 0; ....)
Same question for private in openmp:
Is
void work(float* c, int N)
{
float x, y; int i;
#pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
the same as without the private(x,y) because x and y aren't initialized?
#pragma omp parallel for

Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
In openMP a variable declared outside the parallel scope is shared, unless it is explicitly rendered private.
Hence the shared declaration can be omitted.
But your code is far from being optimal. It works, but will be by far slower than its sequential counterpart, because critical will force sequential processing and creating a critical section has an important temporal cost.
The proper implementation would use a reduction.
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += a[i] * b[i];
}
return sum;
}
The reduction creates a hidden local variable to accumulate in parallel in every thread and before thread destruction performs an atomic addition of these local sums on the shared variable sum.
Same question for private in openmp:
void work(float* c, int N)
{
float x, y; int i;
# pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
By default, x and y are shared. So without private the behaviour will be different (and buggy because all threads will modify the same globally accessible vars x and y without an atomic access).
the same as without the private(x,y) because x and y aren't initialized?
Initialization of x and y does not matter, what is important is where they are declared. To insure proper behavior, they must be rendered private and the code will be correct as xand y are set before been used in the loop.

Parallelize nested for loop with respect to symmetry of all -against-all comparison with C++/OpenMP

I have the simple problem of comparing all elements to each other. The comparison itself is symmetric, therefore, it doesn't have to be done twice.
The following code example shows what I am looking for by showing the indices of the accessed elements:
int n = 5;
for (int i = 0; i < n; i++)
{
for (int j = i + 1; j < n; j++)
{
printf("%d %d\n", i,j);
}
}
The output is:
0 1
0 2
0 3
0 4
1 2
1 3
1 4
2 3
2 4
3 4
So each element is compared to each other once. When I want to parallelize this code I have the problem that first I have to stick to dynamic scheduling because the calculation time of each iteration does vary to a huge extend AND I can not use collapse due to the fact that the nested iterations are index-dependant from the outer loop.
Using #pragma omp parallel for schedule(dynamic, 3) for the outer loop may lead to single core executions at the end whereas using this for the inner loop may lead to such executions within each iteration of the outer loop.
Is there a more sophisticated way of doing/parallelizing that?

I haven't thought it thoroughly, but you can try some approach like this too:
int total = n * (n-1) / 2; // total number of combinations
#pragma omp parallel for
for (int k = 0; k < total; ++k) {
int i = first(k, n);
int j = second(k, n, i);
printf("%d %d\n", i,j);
}
int first(int k, int n) {
int i = 0;
for (; k >= n - 1; ++i) {
k -= n - 1;
n -= 1;
}
return i;
}
int second(int k, int n, int i) {
int t = i * (2*n - i - 1) / 2;
return (t == 0 ? k + i + 1 : (k % t) + i + 1);
}

Indeed, the OpenMP standard says for the collapse that:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes
any of the values used to compute any of the iteration counts, then
the behavior is unspecified.
So you cannot collapse your loops, which would have been the easiest way.
However, since you're not particularly interested in the order the pairs of indexes are computed, you can change a bit your loops as follow:
for ( int i = 0; i < n; i++ ) {
for ( int j = 0; j < n / 2; j++ ) {
int ii, jj;
if ( j < i ) {
ii = n - 1 - i;
jj = n - 1 - j;
}
else {
ii = i;
jj = j + 1;
}
printf( "%d %d\n", ii, jj );
}
}
This should give you all the pairs you want, in a somewhat mangled order, but with fixed iteration limits which allow for balanced parallelisation, and even loop collapsing if you want. Simply, if n is even, the column corresponding to n/2 will be displayed twice so either you live with it or you slightly modify the algorithm to avoid that...

I have previously had good results with the following:
#pragma omp parallel for collapse(2)
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
if (j <= i)
continue;
printf("%d %d\n", i, j);
}
}
Do remember that printf does not do any parallel workload just, so it would be best if you profiled it on your specific work. You could try adding schedule(dynamic, 10) or something greater than 10 depending on how many iterations you're performing.

How can I use openMP for this loop?

I'm wondering if it is feasible to make this loop parallel using openMP.
Of coarse there is the issue with the race conditions. I'm unsure how to deal with the n in the inner loop being generated by the outerloop, and the race condition with where D=A[n]. Do you think it is practical to try and make this parallel?
for(n=0; n < 10000000; ++n) {
for (n2=0; n2< 100; ++n2) {
A[n]=A[n]+B[n2][n+C[n2]+200];
}
D=D+A[n];
}

Yes, this is indeed parallelizable assuming none of the pointers are aliased.
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
for (n2 = 0; n2 < 100; ++n2) {
A[n] = A[n] + B[n2][n + C[n2] + 200];
}
D += A[n];
}
It could actually be optimized somewhat as follows:
int D = 0; // Or whatever the type is.
#pragma omp parallel for reduction(+:D) private(n2)
for (n=0; n < 10000000; ++n) {
int tmp = A[n]
for (n2 = 0; n2 < 100; ++n2) {
tmp += B[n2][n + C[n2] + 200];
}
A[n] = tmp;
D += tmp;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reduction and collapse clauses in OMP have some confusing points - c++

Related

Is there a way to parallelize a lower triangle matrix solver?

Efficient, parallel tensor contraction with vectorized data

openmp shared or nothing. private vs uninitialized

Parallelize nested for loop with respect to symmetry of all -against-all comparison with C++/OpenMP

How can I use openMP for this loop?

Categories

Resources