Using OpenMP in c++ class - c++

I'm new to OpenMP. I'm trying to use OpenMP in my c++ code. The code is too complicated so I simplify the question as follow:
class CTet
{
...
void cal_Mn(...);
}
int i, num_tet_phys;
vector<CTet> tet_phys;
num_tet_phys = ...;
tet_phys.resize(num_tet_phys);
#pragma omp parallel private(i)
for (i = 0; i < num_tet_phys; i++)
tet_phys[i].cal_Mn(...);
I hope that the for loop can run in parallel but it seems that all threads run the whole loop independently. The calculation is repeated by every threads. What's problem in my code? How to fix it?
Thank you!
Jun

Try
#pragma omp parallel for private(i)
for (i = 0; i < num_tet_phys; i++)
tet_phys[i].cal_Mn(...);
Note the use of parallel for.
and compile with the -fopenmp flag.
The #pragma omp parallel creates a team of threads, all of which execute the next statement (in your case, the entire for loop). After the statement, the threads join back into one.
The #pragma omp parallel for creates a team of threads, which divide the work of the for loop between them.

Related

OMP parallel for is not dividing iterations

I am trying to do distributed search using omp.h. I am creating 4 threads. Thread with id 0 does not perform the search instead it overseas which thread has found the number in array. Below is my code:
int arr[15]; //This array is randomly populated
int process=0,i=0,size=15; bool found=false;
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp cancellation point parallel
if(thread_id==0){
while(found==false){ continue; }
if(found==true){
cout<<"Number found by thread: "<<process<<endl;
#pragma omp cancel parallel
}
}
else{
#pragma omp parallel for schedule(static,5)
for(i=0;i<size;i++){
if(arr[i]==number){ //number is a int variable and its value is taken from user
found = true;
process = thread_id;
}
cout<<i<<endl;
}
}
}
The problem i am having is that each thread is executing for loop from i=0 till i=14. According to my understanding omp divides the iteration of the loops but this is not happening here. Can anyone tell me why and its possible solution?
Your problem is that you have a parallel inside a parallel. That means that each thread from the first parallel region makes a new team. That is called nested parallelism and it is allowed, but by default it's turned off. So each thread creates a team of 1 thread, which then executes its part of the for loop, which is the whole loop.
So your omp parallel for should be omp for.
But now there is another problem: your loop is going to be distributed over all threads, except that thread zero never gets to the loop. So you get deadlock.
.... and the actual solution to your problem is a lot more complicated. It involves creating two tasks, one that spins on the shared variable, and one that does the parallel search.
#pragma omp parallel
{
# pragma omp single
{
int p = omp_get_num_threads();
int found = 0;
# pragma omp taskgroup
{
/*
* Task 1 listens to the shared variable
*/
# pragma omp task shared(found)
{
while (!found) {
if (omp_get_thread_num()<0) printf("spin\n");
continue; }
printf("found!\n");
# pragma omp cancel taskgroup
} // end 1st task
/*
* Task 2 does something in parallel,
* sets `found' to true if found
*/
# pragma omp task shared(found)
{
# pragma omp parallel num_threads(p-1)
# pragma omp for
for (int i=0; i<p; i++)
// silly test
if (omp_get_thread_num()==2) {
printf("two!\n");
found = 1;
}
} // end 2nd task
} // end taskgroup
}
}
(Do you note the printf that is never executed? I needed that to prevent the compiler from "optimizing away" the empty while loop.)
Bonus solution:
#pragma omp parallel num_threads(4)
{
if(omp_get_thread_num()==0){ spin_on_found; }
if(omp_get_thread_num()!=0){
#pragma omp for nowait schedule(dynamic)
for ( loop ) stuff
The combination of dynamic and nowait can somehow deal with the missing thread.
#Victor Eijkhout already explained what happened here, I just want to show you a simpler (and data race free) solution.
Note that OpenMP has a significant overhead, in your case the overheads are bigger than the gain by parallelization. So, the best idea is not to use parallelization in this case.
If you do some expensive work inside the loop, the simplest solution is to skip this expensive work if it is not necessary. Note that I have used #pragma omp critical before found = true; to avoid data race.
#pragma omp parallel for
for(int i=0; i<size;i++){
if(found) continue;
// some expensive work here
if(CONDITION){
#pragma omp critical
found = true;
}
}
Another alternative is to use #pragma omp cancel for
#pragma omp parallel
#pragma omp for
for(int i=0; i<size;i++){
#pragma omp cancellation point for
// some expensive work here
if(CONDITION){
//cancelling the for loop
#pragma omp cancel for
}
}

OpenMP taskwait not working

In the following code, I have created a parallel region using the #pragma omp parallel.
Within, the parallel region, there is a section of code that needs to be executed by only one thread which is achieved using #pragma omp single nowait.
Inside, the sequential region their is a FOR loop which can parallelized and I using #pragma omp taskloop to achieve it.
After the loop is done, I have used #pragma omp taskwait so as to make sure that the rest of the code is executed by only one thread. However, it seems the is not behaving as I am expecting. Multiple threads are accessing the section of the code after the #pragma omp taskwait which is declared under the region defined as #pragma omp single nowait.
std::vector<std::unordered_map<int, int>> veg_ht(n_comp + 1);
vec_ht[0].insert({root_comp_id, root_comp_node});
#pragma omp parallel
{
#pragma omp single
{
int nthreads = omp_get_num_threads();
for (int l = 0; l < n_comp; ++l) {
int bucket_count = vec_ht[l].bucket_count();
#pragma omp taskloop
for (int bucket_id = 0; bucket_id < bucket_count; ++bucket_id) {
if (vec_ht[l].bucket_size(bucket_id) == 0) { continue; }
int thread_id = omp_get_thread_num();
for (auto it_vec_ht = vec_ht[l].begin(bucket_id); it_vec_ht != vec_ht[l].end(bucket_id); ++it_vec_ht) {
// some operation --code removed for minimality
} // for it_vec_ht[l]
} // for bucket_id taskloop
#pragma omp taskwait
// Expected that henceforth all code will be accessed by one thread only
for (int tid = 0; tid < nthreads; ++tid) {
// some operation --code removed for minimality
} // for tid
} // for l
} // pragma omp single nowait
} // pragma parallel
It doesn't look like you necessarily need to use the enclosing parallel/single/taskloop layout. If you aren't going to specify the number of threads, then your system should default to using the maximum number of threads available. You can get this value outside of an OMP construct using omp_get_max_threads()'. Then you can use just the taskloop structure, or just replace it with a#pragma omp parallel for`.
I think the issue with your code is the #pragma omp taskwait line. The single thread should fork into many threads when it hits the taskloop construct, and then collapse back to a single thread afterwards. I think you might be triggering a brand new forking of your single thread with the #pragma omp taskwait command. An alternative to #pragma omp taskwait that definitely doesn't trigger thread forking is #pragma omp barrier. I think making this substitution will make your code work in its current form.

Multithreaded Program for Sparse Matrices

I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.)
I also have sequential codes in between which i donot intend to parallelise.
My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.
Also where should I set the number of threads?
Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
rm[i] = b[i] - rm[i];
#pragma omp barrier
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
xm[i] = x0[i];
#pragma omp barrier
double* pm = (double*) malloc(numcols*sizeof(double));
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);
Thanks
EDIT:
for the scalar dotproduct, I am using the following piece of code:
double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10;
int i;
//double* c = (double*) malloc(n*sizeof(double));
omp_set_num_threads(4);
// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
double pprod = 0.0;
#pragma omp for
for(i=0;i<n;i++) {
pprod += vec1[i]*vec2[i];
}
//#pragma omp for reduction (+:prod)
#pragma omp critical
for(i=0;i<n;i++) {
prod += pprod;
}
}
return prod;
}
I have now added the time calculation code in my ConjugateGradient function as below:
start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);
Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110
I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.
I am currently using a matrix of size n = 5000.
I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).
Please suggest some improvements. Any help is much appreciated!
Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallels you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.
So in your case at the very least I would try:
Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region
// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
rm[i] = b[i] - rm[i]; // (1)
xm[i] = x0[i]; // (2) does not require (1)
pm[i] = rm[i]; // (3) requires (1) at this i, not (2)
}
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel
scalarProd(rm,rm,numcols);
Notice how I show that no barriers are actually necessary between your loops anyway.
If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd(), so the amount of time you'll be saving is probably minimal.
** EDIT **
And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:
double scalarProd(double* vec1, double* vec2, int n) {
double prod = 0.0;
int i;
// omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
#pragma omp parallel for private(i) reduction(+:prod)
for (i = 0; i < n; ++i) {
prod += vec1[i]*vec2[i];
}
return prod;
}
(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.

pragma omp for inside pragma omp master or single

I'm sitting with some stuff here trying to make orphaning work, and reduce the overhead by reducing the calls of #pragma omp parallel.
What I'm trying is something like:
#pragma omp parallel default(none) shared(mat,mat2,f,max_iter,tol,N,conv) private(diff,k)
{
#pragma omp master // I'm not against using #pragma omp single or whatever will work
{
while(diff>tol) {
do_work(mat,mat2,f,N);
swap(mat,mat2);
if( !(k%100) ) // Only test stop criteria every 100 iteration
diff = conv[k] = do_more_work(mat,mat2);
k++;
} // end while
} // end master
} // end parallel
The do_work depends on the previous iteration so the while-loop is has to be run sequential.
But I would like to be able to run the ´do_work´ parallel, so it would look something like:
void do_work(double *mat, double *mat2, double *f, int N)
{
int i,j;
double scale = 1/4.0;
#pragma omp for schedule(runtime) // Just so I can test different settings without having to recompile
for(i=0;i<N;i++)
for(j=0;j<N;j++)
mat[i*N+j] = scale*(mat2[(i+1)*N+j]+mat2[(i-1)*N+j] + ... + f[i*N+j]);
}
I hope this can be accomplished some way, I'm just not sure how. So any help I can get is greatly appreciated (also if you're telling me this isn't possible). Btw I'm working with open mp 3.0, the gcc compiler and the sun studio compiler.
The outer parallel region in your original code contains only a serial piece (#pragma omp master), which makes no sense and effectively results in purely serial execution (no parallelism). As do_work() depends on the previous iteration, but you want to run it in parallel, you must use synchronisation. The openmp tool for that is an (explicit or implicit) synchronisation barrier.
For example (code similar to yours):
#pragma omp parallel
for(int j=0; diff>tol; ++j) // must be the same condition for each thread!
#pragma omp for // note: implicit synchronisation after for loop
for(int i=0; i<N; ++i)
work(j,i);
Note that the implicit synchronisation ensures that no thread enters the next j if any thread is still working on the current j.
The alternative
for(int j=0; diff>tol; ++j)
#pragma omp parallel for
for(int i=0; i<N; ++i)
work(j,i);
should be less efficient, as it creates a new team of threads at each iteration, instead of merely synchronising.

the OpenMP "master" pragma must not be enclosed by the "parallel for" pragma

Why won't the intel compiler let me specify that some actions in an openmp parallel for block should be executed by the master thread only?
And how can I do what I'm trying to achieve without this kind of functionality?
What I'm trying to do is update a progress bar through a callback in a parallel for:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
#pragma omp master
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
I want only the master thread to call the callback, because if I don't enforce that (say by using omp critical instead to ensure only one thread uses the callback at once) I get the following runtime exception:
The application called an interface that was marshalled for a different thread.
...hence the desire to keep all callbacks in the master thread.
Thanks in advance.
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
Compiler Error C3034
OpenMP 'master' directive cannot be directly nested within 'parallel for' directive
Visual Studio 2010 OpenMP 2.0
May be so:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
//#pragma omp master it is error
//#pragma omp critical it is right
if (omp_get_thread_num() == 0) // may be good
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
The reason why you get the error is because the master thread isn't there most of the times when the code reaches the #pragma omp master line.
For example, let's take the code from Artyom:
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
If the code would compile, the following could happen:
Let's say thread 0 starts (the master thread). It reaches the pragma that practically says "Master, do the following piece of code". It being the master can run the function.
However, what happens when thread 1 or 2 or 3, etc, reaches that piece of code?
The master directive is telling the present/listening team that the master thread has to execute f(). But the team is a single thread and there is no master present. The program wouldn't know what to do past that point.
And that's why, I think, the master isn't allowed to be inside the for-loop.
Substituting the master directive with if (omp_get_thread_num() == 0) works because now the program says, "If you are master, do this. Otherwise ignore".