Parallel for loop in openmp - c++

I'm trying to parallelize a very simple for-loop, but this is my first attempt at using openMP in a long time. I'm getting baffled by the run times. Here is my code:
#include <vector>
#include <algorithm>
using namespace std;
int main ()
{
int n=400000, m=1000;
double x=0,y=0;
double s=0;
vector< double > shifts(n,0);
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
cout << *std::max_element( shifts.begin(), shifts.end() ) << endl;
}
I compile it with
g++ -O3 testMP.cc -o testMP -I /opt/boost_1_48_0/include
that is, no "-fopenmp", and I get these timings:
real 0m18.417s
user 0m18.357s
sys 0m0.004s
when I do use "-fopenmp",
g++ -O3 -fopenmp testMP.cc -o testMP -I /opt/boost_1_48_0/include
I get these numbers for the times:
real 0m6.853s
user 0m52.007s
sys 0m0.008s
which doesn't make sense to me. How using eight cores can only result in just 3-fold
increase of performance? Am I coding the loop correctly?

You should make use of the OpenMP reduction clause for x and y:
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.
Serial version:
25.05s user 0.01s system 99% cpu 25.059 total
OpenMP version w/ OMP_NUM_THREADS=16:
24.76s user 0.02s system 1590% cpu 1.559 total
See - superlinear speed-up :)

let's try to understand how parallelize simple for loop using OpenMP
#pragma omp parallel
#pragma omp for
for(i = 1; i < 13; i++)
{
c[i] = a[i] + b[i];
}
assume that we have 3 available threads, this is what will happen
firstly
Threads are assigned an independent set of iterations
and finally
Threads must wait at the end of work-sharing construct

Because this question is highly viewed I decided to add a bit a OpenMP background to help those visiting it
The #pragma omp parallel creates a parallel region with a team of threads, where each thread executes the entire block of code that the parallel region encloses.
From the OpenMP 5.1 one can read a more formal description :
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region (..). The
thread that encountered the parallel construct becomes the primary
thread of the new team, with a thread number of zero for the duration
of the new parallel region. All threads in the new team, including the
primary thread, execute the region. Once the team is created, the
number of threads in the team remains constant for the duration of
that parallel region.
The #pragma omp parallel for creates a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.
From the OpenMP 5.1 you can read a more formal description :
The worksharing-loop construct specifies that the iterations of one or
more associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks. The iterations are
distributed across threads that already exist in the team that is
executing the parallel region to which the worksharing-loop region
binds.
Moreover,
The parallel loop construct is a shortcut for specifying a parallel
construct containing a loop construct with one or more associated
loops and no other statements.
Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for. In your case, this would mean that:
#pragma omp parallel for
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
A team of threads will be created, and to each of those threads will be assigned chunks of the iterations of the outermost loop.
To make it more illustrative, with 4 threads the #pragma omp parallel for with a chunk_size=1 and a static schedule would result in something like:
Code-wise the loop would be transformed to something logically similar to:
for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads())
{
c[i]=a[i]+b[i];
}
where omp_get_thread_num()
The omp_get_thread_num routine returns the thread number, within the
current team, of the calling thread.
and omp_get_num_threads()
Returns the number of threads in the current team. In a sequential
section of the program omp_get_num_threads returns 1.
or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.
Armed with this knowledge, and looking at your code, one can see that you have a race-condition on the updates of the variables 'x' and 'y'. Those variables are shared among threads and update inside the parallel region, namely:
x += rand_g1;
y += rand_g2;
To solve this race-condition you can use OpenMP' reduction clause:
Specifies that one or more variables that are private to each thread
are the subject of a reduction operation at the end of the parallel
region.
Informally, the reduction clause, will create for each thread a private copy of the variables 'x' and 'y', and at the end of the parallel region perform the summation among all those 'x' and 'y' variables into the original 'x' and 'y' variables from the initial thread.
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}

What you can achieve at most(!) is a linear speedup.
Now I don't remember which is which with the times from linux, but I'd suggest you to use time.h or (in c++ 11) "chrono" and measure the runtime directly from the programm. Best pack the entire code into a loop, run it 10 times and average to get approx runtime by the prog.
Furthermore you've got imo a problem with x,y - which do not adhere to the paradigm of data locality in parallel programming.

Related

Aspects that affects the efficiency of OpenMP parallelism

I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.

OpenMP parallel-for efficiency query

Please consider the following simple code for summing up values in a parallel for loop:
int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
for (int i = 0; i < 4; i++)
{
nTotalSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
When I run this on a two-core machine, the output I get is
0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5
This suggests to me that the critical section, i.e. the update of nTotalSum, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum with this 'local sum' after it has done so.
Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
int nLocalSum = 0;
for (int i = 0; i < 4; i++)
{
nLocalSum += i;
}
nTotalSum += nLocalSum;
...but the compiler complained stating that it was expecting a for loop following the pragma omp parallel for statement...
Your output does in fact not indicate a critical section during the loop. Each thread has its own zero-initialized copy, thread 0 working on i = 0,1, thread 1 working on i = 2,3. At the end OpenMP takes care of adding the local copies to the original.
You should not try to implement it yourself unless you have specific evidence that you can do it more efficiently. See for example this question / answer.
Your manual version would work if you split the parallel / for into two directives:
int nTotalSum = 0;
#pragma omp parallel
{
// Declare the local variable it here!
// Then it's private implicitly and properly initialized
int localSum = 0;
#pragma omp for
for (int i = 0; i < 4; i++) {
localSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
// Do not forget the atomic, or it would be a race condition!
// Alternative would be a critical, but that's less efficient
#pragma omp atomic
nTotalSum += localSum;
}
I think it's likely that your OpenMP implementation does the reduction just like that.
Each OMP thread has its own copy of nTotalSum. At the end of the OMP section these are combined back into the original nTotalSum. The output you're seeing comes from running loop iterations (0,1) in one thread, and (2,3) in another thread. If you output nTotalSum at the end of your loop, you should see the expected result of 6.
In you nLocalSum example, move the declaration of nLocalSum to before the #pragma omp line. The for loop must be on the line immediately following the pragma.
from my parallel programming in openmp book:
reduction clause can be trickier to understand, has both private and shared storage behavior. The reduction attribute is used on objects that are the target of an arithmetic reduction. This can be important in many applications...reduction allows it to be implemented by the compiler efficiently... this is such a common operation that openmp has the reduction data scope clause just to handle them...most common example is final summation of temporary local variables at the end of the parallel construct.
correction to your second example:
total_sum = 0; /* do all variable initialization prior to omp pragma */
#pragma omp parallel for \
private(i) \
reduction(+:total_sum)
for (int i = 0; i < 4; i++)
{
total_sum += i; /* you used nLocalSum here */
}
#pragma omp end parallel for
/* at this point in the code,
all threads will have done your `for` loop where total_sum is local to each thread,
openmp will then '+" together the values in `total_sum` coming from each thread because we used reduction,
do not do an explicit nTotalSum += nLocalSum after the omp for loop, it's not needed the reduction clause takes care of this
*/
In your first example, I'm not sure of your use of #pragma omp parallel for num_threads(nMaxThreads) reduction(+:nTotalSum) of what num_threads(nMaxThreads) is doing. But i suspect the weird output might be caused by print buffering.
In any case, the reduction clause is very useful and very efficient if used properly. It would be more obvious in a more complicated, real-world example.
Your posted example is so simple that it doesn't show off the usefulness of the reduction clause, and strictly speaking for your example since all threads are doing a summation the most efficient way to do it would just make total_sum a shared variable in the parallel section and have all threads pump in to it. At the end the answer would still be correct. would work if using critical directive.

omp parallel for : How to make threads write to private arrays and merge all the arrays once all the threads finished processing

I have a requirement to calculate z values, push them into arrays B and s2.
I tried to parallelize the processing using omp parallel for.
One problem I see is, If I don't put B[i][j] += z and s2[i] += z statements in critical section, I see lot of NaN values being generated.
Just wondering if there is a way to write the z values to separate arrays (one array per thread) and merge them at the end.
Any help is greatly appreciated.
#pragma omp parallel
{
double z;
#pragma omp parallel for
for(int t=1; t<n; t++) {
double phi_i[N];
double obs_j_seq_t[N];
for(int i=0; i<N; i++) {
for(int j=0; j<N; j++) {
z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t];
#pragma omp critical
{
B[i][j] += z;
s2[i] += z;
}
}
}
}
}
Your code exposes a few issues, each being a potential killer for its performance and / or validity:
You start by using a #pragma omp parallel and then you add a #pragma omp parallel for. That means that you are trying to generate nested parallelism (a parallel region within another parallel region). This is first, a bad idea and second, disabled by default. Therefore, your second parallel directive is ignored and the work on your loop never gets distributed and is executed in full by all the threads you spawned with your initial parallel directive. Therefore, you have race conditions on the writing of the results in B and s2 by all the threads at once. You solve the issue by adding a critical section, but fundamentally, the code is wrong.
Even if you hadn't had this initial parallel directive or with nested parallelism enabled, your code would have been wrong for the following reasons:
Your z variable is shared across the threads of the second parallel region and since it is modified by all of them, its value is undefined as soon as more than one thread is spawned in the region.
Even more fundamentally, you try to parallelize the loop over t, but the solutions are indexed over i. That means that all threads will compete for updating the same indexes, leading once more to race conditions and invalid results. You could again use a critical directive to address that, but that would only make the code super slow. You'd better be parallelizing the loop over i (while possibly swapping the loops over t and i to put the latter the outermost one).
Your code could become something like this (not tested):
#pragma omp parallel for
for(int i=0; i<N; i++) {
for(int t=1; t<n; t++) {
double phi_i[N]; // I guess these need some initialization
double obs_j_seq_t[N]; // Idem
for(int j=0; j<N; j++) {
double z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t];
B[i][j] += z;
s2[i] += z;
}
}
}

Multithreaded Program for Sparse Matrices

I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.)
I also have sequential codes in between which i donot intend to parallelise.
My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.
Also where should I set the number of threads?
Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
rm[i] = b[i] - rm[i];
#pragma omp barrier
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
xm[i] = x0[i];
#pragma omp barrier
double* pm = (double*) malloc(numcols*sizeof(double));
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);
Thanks
EDIT:
for the scalar dotproduct, I am using the following piece of code:
double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10;
int i;
//double* c = (double*) malloc(n*sizeof(double));
omp_set_num_threads(4);
// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
double pprod = 0.0;
#pragma omp for
for(i=0;i<n;i++) {
pprod += vec1[i]*vec2[i];
}
//#pragma omp for reduction (+:prod)
#pragma omp critical
for(i=0;i<n;i++) {
prod += pprod;
}
}
return prod;
}
I have now added the time calculation code in my ConjugateGradient function as below:
start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);
Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110
I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.
I am currently using a matrix of size n = 5000.
I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).
Please suggest some improvements. Any help is much appreciated!
Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallels you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.
So in your case at the very least I would try:
Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region
// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
rm[i] = b[i] - rm[i]; // (1)
xm[i] = x0[i]; // (2) does not require (1)
pm[i] = rm[i]; // (3) requires (1) at this i, not (2)
}
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel
scalarProd(rm,rm,numcols);
Notice how I show that no barriers are actually necessary between your loops anyway.
If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd(), so the amount of time you'll be saving is probably minimal.
** EDIT **
And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:
double scalarProd(double* vec1, double* vec2, int n) {
double prod = 0.0;
int i;
// omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
#pragma omp parallel for private(i) reduction(+:prod)
for (i = 0; i < n; ++i) {
prod += vec1[i]*vec2[i];
}
return prod;
}
(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.

OpenMP - create threads only once

I try to write simple application using OpenMP. Unfortunately I have problem with speedup.
In this application I have one while loop. Body of this loop consists of some instructions which should be done sequentially and one for loop. I use #pragma omp parallel for to make this for loop parallel. This loop doesn't have much work, but is called very often.
I prepare two versions of for loop, and run application on 1, 2 and 4cores.
version 1 (4 iterations in for loop): 22sec, 23sec, 26sec.
version 2 (100000 iterations in for loop): 20sec, 10sec, 6sec.
As you can see, when for loop doesn't have much work, time on 2 and 4 cores is higher than on 1core.
I guess the reason is that #pragma omp parallel for creates new threads in each iteration of while loop. So, I would like to ask you - is there any possibility to create threads once (before while loop), and ensure that some job in while loop will be done sequentially?
#include <omp.h>
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char* argv[])
{
double sum = 0;
while (true)
{
// ...
// some work which should be done sequentially
// ...
#pragma omp parallel for num_threads(atoi(argv[1])) reduction(+:sum)
for(int j=0; j<4; ++j) // version 2: for(int j=0; j<100000; ++j)
{
double x = pow(j, 3.0);
x = sqrt(x);
x = sin(x);
x = cos(x);
x = tan(x);
sum += x;
double y = pow(j, 3.0);
y = sqrt(y);
y = sin(y);
y = cos(y);
y = tan(y);
sum += y;
double z = pow(j, 3.0);
z = sqrt(z);
z = sin(z);
z = cos(z);
z = tan(z);
sum += z;
}
if (sum > 100000000)
{
break;
}
}
return 0;
}
Most OpenMP implementations create a number of threads on program startup and keep them for the duration of the program. That is, most implementations don't dynamically create and destroy threads during execution; to do so would hit performance with severe thread management costs. This approach to thread management is consistent with, and appropriate for, the usual use cases for OpenMP.
It is far more likely that the slowdown you see when you increase the number of OpenMP threads is down to imposing a parallel overhead on a loop with a tiny number of iterations. Hristo's answer covers this.
You could move the parallel region outside of the while (true) loop and use the single directive to make the serial part of the code to execute in one thread only. This will remove the overhead of the fork/join model. Also OpenMP is not really useful on thight loops with very small number of iterations (like your version 1). You are basically measuring the OpenMP overhead since the work inside the loop is done really fast - even 100000 iterations with transcendental functions take less than second on current generation CPU (at 2 GHz and roughly 100 cycles per FP instruciton other than addition, it'll take ~100 ms).
That's why OpenMP provides the if(condition) clause that can be used to selectively turn off the parallelisation for small loops:
#omp parallel for ... if(loopcnt > 10000)
for (i = 0; i < loopcnt; i++)
...
It is also advisable to use schedule(static) for regular loops (that is for loops in which every iteration takes about the same time to compute).