I'm working on factorial function. I have to write its parallel version using OpenMP.
double sequentialFactorial(const int N) {
double result = 1;
for(int i = 1; i <= N; i++) {
result *= i;
}
return result;
}
It is well known that this algorithm can be efficiently parallelized using reduction tecnique.
I'm aware of the existence of reduction clause (standard §§ 2.15.3.6).
double parallelAutomaticFactorial(const int N) {
double result = 1;
#pragma omp parallel for reduction(*:result)
for (int i=1; i <= N; i++)
result *= i;
return result;
}
However, I want to try to implement reduction tecnique "handmade".
double parallelHandmadeFactorial(const int N) {
// maximum number of threads
const int N_THREADS = omp_get_max_threads();
// table of partial results
double* partial = new double[N_THREADS];
for(int i = 0; i < N_THREADS; i++) {
partial[i] = 1;
}
// reduction tecnique
#pragma omp parallel for
for(int i = 1; i <= N; i++) {
int thread_index = omp_get_thread_num();
partial[thread_index] *= i;
}
// fold results
double result = 1;
for(int i = 0; i < N_THREADS; i++) {
result *= partial[i];
}
delete partial;
return result;
}
I expect the performance of the last two snippet to be very similar, and better than the first one. However, the average performance is:
Sequential Factorial 3500 ms
Parallel Handmade Factorial 6100 ms
Parallel Automatic Factorial 600 ms
Am I missing something?
Thanks to #Gilles and #P.W, this code works as expected
double parallelNoWaitFactorial(const int N) {
double result = 1;
#pragma omp parallel
{
double my_local_result = 1;
// removing nowait does not change the performance
#pragma omp for nowait
for(int i = 1; i <= N; i++)
my_local_result *= i;
#pragma omp atomic
result *= my_local_result;
}
return result;
}
If array elements happen to share a cache line, this leads to false sharing which further leads to performance degradation.
To avoid this:
Use a private variable double partial instead of the double array
partial.
Use the partial result of each thread to compute the final result in a critical region
This final result should a variable that is not private to the parallel region.
The critical region will look like this:
#pragma omp critical
result *= partial;
Related
I've just started studying parallel programming with OpenMP, and there is a subtle point in the nested loop. I wrote a simple matrix multiplication code, and checked the result that is correct. But actually there are several ways to parallelize this for loop, which may be different in terms of low-level detail, and I wanna ask about it.
At first, I wrote code below, which multiply two matrix A, B and assign the result to C.
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
#pragma omp parallel for reduction(+:sum)
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
It works, but it takes really long time. And I find out that because of the location of parallel directive, it will construct the parallel region N2 time. I found it by huge increase in user time when I used linux time command.
Next time, I tried code below which also worked.
#pragma omp parallel for private(i, j, k, sum)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
And the elapsed time is decreased from 72.720s in sequential execution to 5.782s in parallel execution with the code above. And it is the reasonable result because I executed it with 16 cores.
But the flow of the second code is not easily drawn in my mind. I know that if we privatize all loop variables, the program will consider that nested loop as one large loop with size N3. It can be easily checked by executing the code below.
#pragma omp parallel for private(i, j, k)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
for(k = 0; k < N; k++)
{
printf("%d, %d, %d\n", i, j, k);
}
}
}
The printf was executed N3 times.
But in my second matrix multiplication code, there is sum right before and after the innermost loop. And It bothers me to unfold the loop in my mind easily. The third code I wrote is easily unfolded in my mind.
To summarize, I want to know what really happens behind the scene in my second matrix multiplication code, especially with the change of the value of sum. Or I'll really thank you for some recommendation of tools to observe the flow of multithreads program written with OpenMP.
omp for by default only applies to the next direct loop. The inner loops are not affected at all. This means, your can think about your second version like this:
// Example for two threads
with one thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = 0; i < N / 2; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
with the other thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = N / 2; i < N; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
You can simply all reasoning about variables with OpenMP by declaring them as locally as possible. I.e. instead of the explicit declaration use:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
int sum = 0;
for(int k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
This way you the private scope of variable more easily.
In some cases it can be beneficial to apply parallelism to multiple loops.
This is done by using collapse, i.e.
#pragma omp parallel for collapse(2)
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
You can imagine this works with a transformation like:
#pragma omp parallel for
for (int ij = 0; ij < N * N; ij++)
{
int i = ij / N;
int j = ij % N;
A collapse(3) would not work for this loop because of the sum = 0 in-between.
Now is one more detail:
#pragma omp parallel for
is a shorthand for
#pragma omp parallel
#pragma omp for
The first creates the threads - the second shares the work of a loop among all threads reaching this point. This may not be of importance for the understanding now, but there are use-cases for which it matters. For instance you could write:
#pragma omp parallel
for(int i = 0; i < N; i++)
{
#pragma omp for
for(int j = 0; j < N; j++)
{
I hope this sheds some light on what happens there from a logical point of view.
How do I get a better optimization for this piece of code using openmp.
Number of threads is 6, but can't get better performance.
I have tried different scheduling options, but i can't get it optimized better.
Is there a way of getting a better result ?
int lenght = 40000;
int idx;
long *result = new long[ size ];
#pragma omp parallel for private(idx) schedule(dynamic)
for ( int i = 0; i < lenght; i++ ) {
for ( int j = 0; j < i; j++ ) {
idx = (int)( someCalculations( i, j ) );
#pragma omp atomic
result[ idx ] += 1;
}
}
This piece of code does optimize the calculation time, but I still need a better result.
Thanks in advance.
Since OpenMP 4.0 you can write your own reduction.
The idea is :
in for loop, you tell the compiler to reduce the place you modify in each loop.
since omp doesn't know how to reduce such array, you must write your own adder my_add which will simply sum two array.
you tell omp how to use it in your reducer (myred)
#include <stdio.h>
#include <stdlib.h>
#define LEN 40000
int someCalculations(int i, int j)
{
return i * j % 40000 ;
}
/* simple adder, just sum x+y in y */
long *my_add(long * x, long *y)
{
int i;
#pragma omp parallel for private(i)
for (i = 0; i < LEN; ++i)
{
x[i] += y[i];
}
free(y);
return x;
}
/* reduction declaration:
name
type
operation to be performed
initializer */
#pragma omp declare reduction(myred: long*:omp_out=my_add(omp_out,omp_in))\
initializer(omp_priv=calloc(LEN, sizeof(long)))
int main(void)
{
int i, j;
long *result = calloc(LEN, sizeof *result);
// tell omp how to use it
#pragma omp parallel for reduction(myred:result) private (i, j)
for (i = 0; i < LEN; i++) {
for (j = 0; j < i; j++) {
int idx = someCalculations(i, j);
result[idx] += 1;
}
}
// simple display, I store it in a file and compare
// result files with/without openmp to be sure it's correct...
for (i = 0; i < LEN; ++i) {
printf("%ld\n", result[i]);
}
return 0;
}
Without -fopenmp: real 0m3.727s
With -fopenmp: real 0m0.835s
Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
and the same code but line 4 doesn't have the shared(sum) because sum is already initialized?
#pragma omp parallel for
for(int = 0; ....)
Same question for private in openmp:
Is
void work(float* c, int N)
{
float x, y; int i;
#pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
the same as without the private(x,y) because x and y aren't initialized?
#pragma omp parallel for
Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
In openMP a variable declared outside the parallel scope is shared, unless it is explicitly rendered private.
Hence the shared declaration can be omitted.
But your code is far from being optimal. It works, but will be by far slower than its sequential counterpart, because critical will force sequential processing and creating a critical section has an important temporal cost.
The proper implementation would use a reduction.
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += a[i] * b[i];
}
return sum;
}
The reduction creates a hidden local variable to accumulate in parallel in every thread and before thread destruction performs an atomic addition of these local sums on the shared variable sum.
Same question for private in openmp:
void work(float* c, int N)
{
float x, y; int i;
# pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
By default, x and y are shared. So without private the behaviour will be different (and buggy because all threads will modify the same globally accessible vars x and y without an atomic access).
the same as without the private(x,y) because x and y aren't initialized?
Initialization of x and y does not matter, what is important is where they are declared. To insure proper behavior, they must be rendered private and the code will be correct as xand y are set before been used in the loop.
I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).
In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).
If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.
#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>
static double ActionWithNumber(double number)
{
double sum = 0.0f;
for (std::uint32_t i = 0; i < 50; i++)
{
double coeff = sqrt(pow(std::abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
return sum;
}
static double TestOpenMP(void)
{
const std::uint32_t len = 4000000;
double *a;
double *b;
double *c;
double sum = 0.0;
std::mutex _mutex;
a = new double[len];
b = new double[len];
c = new double[len];
for (std::uint32_t i = 0; i < len; i++)
{
c[i] = 0.0;
a[i] = sin((double)i);
b[i] = cos((double)i);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
double k = 2.0;
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
c[i] = k*a[i] + b[i] + k;
if (c[i] > 0.0)
{
c[i] += ActionWithNumber(c[i]);
}
else
{
c[i] -= ActionWithNumber(c[i]);
}
std::lock_guard<std::mutex> scoped(_mutex);
sum += c[i];
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
double sum2 = 0.0;
for (std::uint32_t i = 0; i < len; i++)
{
sum2 += c[i];
c[i] /= sum2;
}
if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
delete[] a;
delete[] b;
delete[] c;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const std::uint32_t steps = 5;
for (std::uint32_t i = 0; i < steps; i++)
{
sum += TestOpenMP();
}
sum /= (double)steps;
std::cout << "Elapsed time = " << sum;
return 0;
}
I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.
What is the cause of this difference?
First of all, thank for edit my post (i use translater to write text).
In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements.
So just to put a critical section in there comes out, as it will block writes to the entire matrix.
I wrote code like in real application.
static double ActionWithNumber(double number)
{
const unsigned int steps = 5000;
double sum = 0.0f;
for (u32 i = 0; i < steps; i++)
{
double coeff = sqrt(pow(abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
sum /= (double)steps;
return sum;
}
static double RealAppTest(void)
{
const unsigned int elementsNum = 10000;
double* matrix;
unsigned int* elements;
boost::mutex* mutexes;
elements = new unsigned int[elementsNum*3];
matrix = new double[elementsNum*elementsNum];
mutexes = new boost::mutex[elementsNum];
for (unsigned int i = 0; i < elementsNum; i++)
for (unsigned int j = 0; j < elementsNum; j++)
matrix[i*elementsNum + j] = (double)(rand() % 100);
for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
{
elements[3*i] = rand()%(elementsNum-1);
elements[3*i+1] = rand()%(elementsNum-1);
elements[3*i+2] = rand()%(elementsNum-1);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
boost::lock_guard<boost::mutex> lockup(mutexes[i]);
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
}
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
delete[] elements;
delete[] matrix;
delete[] mutexes;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const u32 steps = 5;
for (u32 i = 0; i < steps; i++)
{
sum += RealAppTest();
}
sum /= (double)steps;
std::cout<<"Elapsed time = " << sum;
return 0;
}
You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.
My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.
For mutex-like behaviour in the OpenMP framework, we can use critical sections:
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
#pragma omp critical
sum += c[i];
}
or explicit locks:
omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
omp_set_lock(&sumlock);
sum += c[i];
omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);
We get much more reasonable timings:
$ time ./openmp-original
real 1m41.119s
user 1m15.961s
sys 1m53.919s
$ time ./openmp-critical
real 0m16.470s
user 1m2.313s
sys 0m0.599s
$ time ./openmp-locks
real 0m15.819s
user 1m0.820s
sys 0m0.276s
Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:
omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++)
omp_init_lock(&(sumlocks[idx]));
//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
omp_set_lock(&(sumlocks[i]));
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
omp_unset_lock(&(sumlocks[i]));
}
}
for (unsigned idx=0; idx<elementsNum; idx++)
omp_destroy_lock(&(sumlocks[idx]));
I'm trying to learn OpenMP, but the professor moved on to a different subject and I feel like I haven't learned a whole lot (or understood).
After looking at some solved questions here on SO I wrote this bit of code:
Working code now looks like this:
void many_iterations()
{
int it, i, j;
for (it = 0; it < NUM_ITERATIONS; it++)
{
#pragma omp parallel
{
#pragma omp for private(j)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
{
if (i == j) B[i][j] = A[i][j] * 2;
else B[i][j] = A[i][j] * 3;
}
}
int **aux = A;
A = B; B = aux;
}
}
I also wrote a serial version (without the #pragma omp bits) and noticed that this version does not actually properly work (outputing A is different between the serial and this version). I then managed to change the two inner for loops to this working bit (correct output as far as I can tell):
for (index = 0; index < N * M; index++)
{
int i = index / M, j = index % M;
// rest of code here
This one does work, but I ran into a problem: running on two threads, it is just as fast as the serial version (with 2 inner fors) and when I tried running this with only one thread the execution time was a lot slower.
Reading online I understood that the parallel section should somehow start before the main for so that it reduces the overhead, but again, my output (A) is wrong.
So my issues are:
How do I set #pragma omp parallel before the first for without ruining the code?
Why is the serial version equal to the 2-thread version of the code with collapsed for loops?
How should I make the code actually more efficient when running on multiple threads?
As a side note, I tried running the serial version with collapsed for loops and I got it to run a lot slower (just like the "parallel" version with 1 thread).
Edit: Trying to use #pragma omp parallel before the it loop:
void many_iterations()
{
int it, i, j;
#pragma omp parallel
{
for (it = 0; it < NUM_ITERATIONS; it++)
{
#pragma omp for private(j)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
{
if (i == j) B[i][j] = A[i][j] * 2;
else B[i][j] = A[i][j] * 3;
}
#pragma omp single
{
int **aux = A;
A = B; B = aux;
}
}
}
}