openmp shared or nothing. private vs uninitialized

openmp shared or nothing. private vs uninitialized - c++

Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
and the same code but line 4 doesn't have the shared(sum) because sum is already initialized?
#pragma omp parallel for
for(int = 0; ....)
Same question for private in openmp:
Is
void work(float* c, int N)
{
float x, y; int i;
#pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
the same as without the private(x,y) because x and y aren't initialized?
#pragma omp parallel for

Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
In openMP a variable declared outside the parallel scope is shared, unless it is explicitly rendered private.
Hence the shared declaration can be omitted.
But your code is far from being optimal. It works, but will be by far slower than its sequential counterpart, because critical will force sequential processing and creating a critical section has an important temporal cost.
The proper implementation would use a reduction.
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += a[i] * b[i];
}
return sum;
}
The reduction creates a hidden local variable to accumulate in parallel in every thread and before thread destruction performs an atomic addition of these local sums on the shared variable sum.
Same question for private in openmp:
void work(float* c, int N)
{
float x, y; int i;
# pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
By default, x and y are shared. So without private the behaviour will be different (and buggy because all threads will modify the same globally accessible vars x and y without an atomic access).
the same as without the private(x,y) because x and y aren't initialized?
Initialization of x and y does not matter, what is important is where they are declared. To insure proper behavior, they must be rendered private and the code will be correct as xand y are set before been used in the loop.

Related

Optimizing parallel nested loop with inner loop dependent on outer loop using openmp

How do I get a better optimization for this piece of code using openmp.
Number of threads is 6, but can't get better performance.
I have tried different scheduling options, but i can't get it optimized better.
Is there a way of getting a better result ?
int lenght = 40000;
int idx;
long *result = new long[ size ];
#pragma omp parallel for private(idx) schedule(dynamic)
for ( int i = 0; i < lenght; i++ ) {
for ( int j = 0; j < i; j++ ) {
idx = (int)( someCalculations( i, j ) );
#pragma omp atomic
result[ idx ] += 1;
}
}
This piece of code does optimize the calculation time, but I still need a better result.
Thanks in advance.

Since OpenMP 4.0 you can write your own reduction.
The idea is :
in for loop, you tell the compiler to reduce the place you modify in each loop.
since omp doesn't know how to reduce such array, you must write your own adder my_add which will simply sum two array.
you tell omp how to use it in your reducer (myred)
#include <stdio.h>
#include <stdlib.h>
#define LEN 40000
int someCalculations(int i, int j)
{
return i * j % 40000 ;
}
/* simple adder, just sum x+y in y */
long *my_add(long * x, long *y)
{
int i;
#pragma omp parallel for private(i)
for (i = 0; i < LEN; ++i)
{
x[i] += y[i];
}
free(y);
return x;
}
/* reduction declaration:
name
type
operation to be performed
initializer */
#pragma omp declare reduction(myred: long*:omp_out=my_add(omp_out,omp_in))\
initializer(omp_priv=calloc(LEN, sizeof(long)))
int main(void)
{
int i, j;
long *result = calloc(LEN, sizeof *result);
// tell omp how to use it
#pragma omp parallel for reduction(myred:result) private (i, j)
for (i = 0; i < LEN; i++) {
for (j = 0; j < i; j++) {
int idx = someCalculations(i, j);
result[idx] += 1;
}
}
// simple display, I store it in a file and compare
// result files with/without openmp to be sure it's correct...
for (i = 0; i < LEN; ++i) {
printf("%ld\n", result[i]);
}
return 0;
}
Without -fopenmp: real 0m3.727s
With -fopenmp: real 0m0.835s

OpenMP Handmade reduction directive

I'm working on factorial function. I have to write its parallel version using OpenMP.
double sequentialFactorial(const int N) {
double result = 1;
for(int i = 1; i <= N; i++) {
result *= i;
}
return result;
}
It is well known that this algorithm can be efficiently parallelized using reduction tecnique.
I'm aware of the existence of reduction clause (standard §§ 2.15.3.6).
double parallelAutomaticFactorial(const int N) {
double result = 1;
#pragma omp parallel for reduction(*:result)
for (int i=1; i <= N; i++)
result *= i;
return result;
}
However, I want to try to implement reduction tecnique "handmade".
double parallelHandmadeFactorial(const int N) {
// maximum number of threads
const int N_THREADS = omp_get_max_threads();
// table of partial results
double* partial = new double[N_THREADS];
for(int i = 0; i < N_THREADS; i++) {
partial[i] = 1;
}
// reduction tecnique
#pragma omp parallel for
for(int i = 1; i <= N; i++) {
int thread_index = omp_get_thread_num();
partial[thread_index] *= i;
}
// fold results
double result = 1;
for(int i = 0; i < N_THREADS; i++) {
result *= partial[i];
}
delete partial;
return result;
}
I expect the performance of the last two snippet to be very similar, and better than the first one. However, the average performance is:
Sequential Factorial 3500 ms
Parallel Handmade Factorial 6100 ms
Parallel Automatic Factorial 600 ms
Am I missing something?
Thanks to #Gilles and #P.W, this code works as expected
double parallelNoWaitFactorial(const int N) {
double result = 1;
#pragma omp parallel
{
double my_local_result = 1;
// removing nowait does not change the performance
#pragma omp for nowait
for(int i = 1; i <= N; i++)
my_local_result *= i;
#pragma omp atomic
result *= my_local_result;
}
return result;
}

If array elements happen to share a cache line, this leads to false sharing which further leads to performance degradation.
To avoid this:
Use a private variable double partial instead of the double array
partial.
Use the partial result of each thread to compute the final result in a critical region
This final result should a variable that is not private to the parallel region.
The critical region will look like this:
#pragma omp critical
result *= partial;

Reduction and collapse clauses in OMP have some confusing points

Both of reduction and collapse clauses in OMP confuses me,
some points raised popped into my head
Why reduction doesn't work with minus? as in the limitation listed here
Is there any work around to achieve minus?
How does a unary operator work, i.e. x++ or x--? is the -- or ++ applied to each partial result? or only once at the creation of the global result? both cases are totally different.
About the collapse..
could we apply collapse on a nested loops but have some lines of code in between
for example
for (int i = 0; i < 4; i++)
{
cout << "Hi"; //This is an extra line. which breaks the 2 loops.
for (int j = 0; j < 100; j++)
{
cout << "*";
}
}

1 & 2. For minus, what are you subtracting from? If you have two threads, do you do result_thread_1 - result_thread_2, or result_thread_2 - result_thread_1? If you have more than 2 threads, then it gets even more confusing: Do I only have one negative term and all others are positive? Is there only one positive term and others are negative? Is it a mix? Which results are which? As such, no, there is no workaround.
In the event of x++ or x--, assuming that they are within the reduction loop, they should happen to each partial result.
Yes, I believe so.

The reduction clause requires that the operation is associative and the x = a[i] - x operation in
for(int i=0; i<n; i++) x = a[i] - x;
is not associative. Try a few iterations.
n = 0: x = x0;
n = 1: x = a[0] - x0;
n = 2: x = a[1] - (a[0] - x0)
n = 3: x = a[2] - (a[1] - (a[0] - x0))
= a[2] - a[1] + a[0] - x0;
But x = x - a[i] does work e.g.
n = 3: x = x0 - (a[2] + a[1] + a[0]);
However there is a workaround. The sign alternates every other term. Here is a working solution.
#include <stdio.h>
#include <omp.h>
int main(void) {
int n = 18;
float x0 = 3;
float a[n];
for(int i=0; i<n; i++) a[i] = i;
float x = x0;
for(int i=0; i<n; i++) x = a[i] - x; printf("%f\n", x);
int sign = n%2== 0 ? -1 : 1 ;
float s = -sign*x0;
#pragma omp parallel
{
float sp = 0;
int signp = 1;
#pragma omp for schedule(static)
for(int i=0; i<n; i++) sp += signp*a[i], signp *= -1;
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++)
#pragma omp ordered
s += sign*sp, sign *= signp;
}
printf("%f\n", s);
}
Here is a simpler version which uses the reduction clause. The thing to notice is that the odd terms are all one sign and the even terms another. So if we do the reduction two terms at a time the sign does not change and the operation is associative.
x = x0;
for(int i=0; i<n; i++) x = a[i] - x
can be reduced in parallel like this.
x = n%2 ? a[0] - x0 : x0;
#pragma omp parallel for reduction (+:x)
for(int i=0; i<n/2; i++) x += a[2*i+1+n%2] - a[2*i+n%2];

OpenMP SIMD reduction with custom operator

I have the following loop that I'd like to accelerate using #pragma omp simd:
#define N 1024
double* data = new double[N];
// Generate data, not important how.
double mean = 0.0
for (size_t i = 0; i < N; i++) {
mean += (data[i] - mean) / (i+1);
}
As I expected, just putting #pragma omp simd directly before the loop has no impact (I'm examining running times). I can tackle the multi-threaded case easily enough using #pragma omp parallel for reduction(...) with a custom reducer as shown below, but how do I put OpenMP SIMD to use here?
I'm using the following class for implementing the + and += operators for adding a double to a running mean as well as combining two running means:
class RunningMean {
private:
double mean;
size_t count;
public:
RunningMean(): mean(0), count(0) {}
RunningMean(double m, size_t c): mean(m), count(c) {}
RunningMean operator+(RunningMean& rhs) {
size_t c = this->count + rhs.count;
double m = (this->mean*this->count + rhs.mean*rhs.count) / c;
return RunningMean(m, c);
}
RunningMean operator+(double rhs) {
size_t c = this->count + 1;
double m = this->mean + (rhs - this->mean) / c;
return RunningMean(m, c);
}
RunningMean& operator+=(const RunningMean& rhs) {
this->mean = this->mean*this->count + rhs.mean*rhs.count;
this->count += rhs.count;
this->mean /= this->count;
return *this;
}
RunningMean& operator+=(double rhs) {
this->count++;
this->mean += (rhs - this->mean) / this->count;
return *this;
}
double getMean() { return mean; }
size_t getCount() { return count; }
};
The maths for this comes from http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf.
For multi-threaded, non-SIMD parallel reduction I do the following:
#pragma omp declare reduction (runningmean : RunningMean : omp_out += omp_in)
RunningMean mean;
#pragma omp parallel for reduction(runningmean:mean)
for (size_t i = 0; i < N; i++)
mean += data[i];
This gives me a 3.2X speedup on my Core i7 2600k using 8 threads.
If I was to implement the SIMD myself without OpenMP, I would just maintain 4 means in a vector, 4 counts in another vector (assuming the use of AVX instructions) and keep on adding 4-element double precision vectors using a vectorised version of operator+(double rhs). Once that is done, I would add the resulting 4 pairs of means and counts using the maths from operator+=. How can I instruct OpenMP to do this?

The problem is that
mean += (data[i] - mean) / (i+1);
is not easily amenable to SIMD. However, by studying the math carefully it's possible to vectorized this without too much effort.
The key forumla is
mean(n+m) = (n*mean(n) + m*mean(m))/(n+m)
which shows how to add the means of n numbers and the mean of m numbers. This can be seen in your operator definition RunningMean operator+(RunningMean& rhs). This explains why your parallel code works. I think this is more clear if we deconvolute your C++ code:
double mean = 0.0;
int count = 0;
#pragma omp parallel
{
double mean_private = 0.0;
int count_private = 0;
#pragma omp for nowait
for(size_t i=0; i<N; i++) {
count_private ++;
mean_private += (data[i] - mean_private)/count_private;
}
#pragma omp critical
{
mean = (count_private*mean_private + count*mean);
count += count_private;
mean /= count;
}
}
But we can use the same idea with SIMD (and combine them together). But let's first do the SIMD only part. Using AVX we can handle four parallel means at once. Each parallel mean will handle the following data elements:
mean 1 data elements: 0, 4, 8, 12,...
mean 2 data elements: 1, 5, 9, 13,...
mean 3 data elements: 2, 6, 10, 14,...
mean 4 data elements: 3, 7, 11, 15,...
One we have looped over all the elements then we add the four parallel sums together and divide by four (since each sum runs over N/4 elements).
Here is the code to to this
double mean = 0.0;
__m256d mean4 = _mm256_set1_pd(0.0);
__m256d count4 = _mm256_set1_pd(0.0);
for(size_t i=0; i<N/4; i++) {
count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
__m256d t1 = _mm256_loadu_pd(&data[4*i]);
__m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
mean4 = _mm256_add_pd(t2, mean4);
}
__m256d t1 = _mm256_hadd_pd(mean4,mean4);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
mean = _mm_cvtsd_f64(t3)/4;
int count = 0;
double mean2 = 0;
for(size_t i=4*(N/4); i<N; i++) {
count++;
mean2 += (data[i] - mean2)/count;
}
mean = (4*(N/4)*mean + count*mean2)/N;
Finally, we can combine this with OpenMP to get the full benefit of SIMD and MIMD like this
double mean = 0.0;
int count = 0;
#pragma omp parallel
{
double mean_private = 0.0;
int count_private = 0;
__m256d mean4 = _mm256_set1_pd(0.0);
__m256d count4 = _mm256_set1_pd(0.0);
#pragma omp for nowait
for(size_t i=0; i<N/4; i++) {
count_private++;
count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
__m256d t1 = _mm256_loadu_pd(&data[4*i]);
__m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
mean4 = _mm256_add_pd(t2, mean4);
}
__m256d t1 = _mm256_hadd_pd(mean4,mean4);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
mean_private = _mm_cvtsd_f64(t3)/4;
#pragma omp critical
{
mean = (count_private*mean_private + count*mean);
count += count_private;
mean /= count;
}
}
int count2 = 0;
double mean2 = 0;
for(size_t i=4*(N/4); i<N; i++) {
count2++;
mean2 += (data[i] - mean2)/count2;
}
mean = (4*(N/4)*mean + count2*mean2)/N;
And here is a working example (compile with -O3 -mavx -fopenmp)
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>
double mean_simd(double *data, const int N) {
double mean = 0.0;
__m256d mean4 = _mm256_set1_pd(0.0);
__m256d count4 = _mm256_set1_pd(0.0);
for(size_t i=0; i<N/4; i++) {
count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
__m256d t1 = _mm256_loadu_pd(&data[4*i]);
__m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
mean4 = _mm256_add_pd(t2, mean4);
}
__m256d t1 = _mm256_hadd_pd(mean4,mean4);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
mean = _mm_cvtsd_f64(t3)/4;
int count = 0;
double mean2 = 0;
for(size_t i=4*(N/4); i<N; i++) {
count++;
mean2 += (data[i] - mean2)/count;
}
mean = (4*(N/4)*mean + count*mean2)/N;
return mean;
}
double mean_simd_omp(double *data, const int N) {
double mean = 0.0;
int count = 0;
#pragma omp parallel
{
double mean_private = 0.0;
int count_private = 0;
__m256d mean4 = _mm256_set1_pd(0.0);
__m256d count4 = _mm256_set1_pd(0.0);
#pragma omp for nowait
for(size_t i=0; i<N/4; i++) {
count_private++;
count4 = _mm256_add_pd(count4,_mm256_set1_pd(1.0));
__m256d t1 = _mm256_loadu_pd(&data[4*i]);
__m256d t2 = _mm256_div_pd(_mm256_sub_pd(t1, mean4), count4);
mean4 = _mm256_add_pd(t2, mean4);
}
__m256d t1 = _mm256_hadd_pd(mean4,mean4);
__m128d t2 = _mm256_extractf128_pd(t1,1);
__m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
mean_private = _mm_cvtsd_f64(t3)/4;
#pragma omp critical
{
mean = (count_private*mean_private + count*mean);
count += count_private;
mean /= count;
}
}
int count2 = 0;
double mean2 = 0;
for(size_t i=4*(N/4); i<N; i++) {
count2++;
mean2 += (data[i] - mean2)/count2;
}
mean = (4*(N/4)*mean + count2*mean2)/N;
return mean;
}
int main() {
const int N = 1001;
double data[N];
for(int i=0; i<N; i++) data[i] = 1.0*rand()/RAND_MAX;
float sum = 0; for(int i=0; i<N; i++) sum+= data[i]; sum/=N;
printf("mean %f\n", sum);
printf("mean_simd %f\n", mean_simd(data, N);
printf("mean_simd_omp %f\n", mean_simd_omp(data, N));
}

The KISS answer: Just calculate the mean outside the loop. Parallelize the following code:
double sum = 0.0;
for(size_t i = 0; i < N; i++) sum += data[i];
double mean = sum/N;
The sum is easily parallelizeable, but you won't see any effect of SIMD optimization: it is purely memory bound, the CPU will only be waiting for data from memory. If N is as small as 1024, there is even little point in parallelization, the synchronization overhead will eat up all the gains.

openMP nested parallel for loops vs inner parallel for

If I use nested parallel for loops like this:
#pragma omp parallel for schedule(dynamic,1)
for (int x = 0; x < x_max; ++x) {
#pragma omp parallel for schedule(dynamic,1)
for (int y = 0; y < y_max; ++y) {
//parallelize this code here
}
//IMPORTANT: no code in here
}
is this equivalent to:
for (int x = 0; x < x_max; ++x) {
#pragma omp parallel for schedule(dynamic,1)
for (int y = 0; y < y_max; ++y) {
//parallelize this code here
}
//IMPORTANT: no code in here
}
Is the outer parallel for doing anything other than creating a new task?

If your compiler supports OpenMP 3.0, you can use the collapse clause:
#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (int x = 0; x < x_max; ++x) {
for (int y = 0; y < y_max; ++y) {
//parallelize this code here
}
//IMPORTANT: no code in here
}
If it doesn't (e.g. only OpenMP 2.5 is supported), there is a simple workaround:
#pragma omp parallel for schedule(dynamic,1)
for (int xy = 0; xy < x_max*y_max; ++xy) {
int x = xy / y_max;
int y = xy % y_max;
//parallelize this code here
}
You can enable nested parallelism with omp_set_nested(1); and your nested omp parallel for code will work but that might not be the best idea.
By the way, why the dynamic scheduling? Is every loop iteration evaluated in non-constant time?

NO.
The first #pragma omp parallel will create a team of parallel threads and the second will then try to create for each of the original threads another team, i.e. a team of teams. However, on almost all existing implementations the second team has just only one thread: the second parallel region is essentially not used. Thus, your code is more like equivalent to
#pragma omp parallel for schedule(dynamic,1)
for (int x = 0; x < x_max; ++x) {
// only one x per thread
for (int y = 0; y < y_max; ++y) {
// code here: each thread loops all y
}
}
If you don't want that, but only parallelise the inner loop, you can do this:
#pragma omp parallel
for (int x = 0; x < x_max; ++x) {
// each thread loops over all x
#pragma omp for schedule(dynamic,1)
for (int y = 0; y < y_max; ++y) {
// code here, only one y per thread
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

openmp shared or nothing. private vs uninitialized - c++

Related

Optimizing parallel nested loop with inner loop dependent on outer loop using openmp

OpenMP Handmade reduction directive

Reduction and collapse clauses in OMP have some confusing points

OpenMP SIMD reduction with custom operator

openMP nested parallel for loops vs inner parallel for

Categories

Resources