My code is threaded using OpenMP. I'm now using the Intel Inspector to check for data races between the threads and I got some inside a critical section.
The relevant code is:
#pragma omp parallel num_threads(current_num_threads) default(none) shared(max_particle_radius_,min_particle_radius_,...)
{
double thread_max_radius = max_particle_radius_;
double thread_min_radius = min_particle_radius_;
#pragma omp for schedule(static)
for (ID i = 0; i<n_particles_; ++i) {
// ...
thread_max_radius = max(thread_max_radius, radius);
thread_min_radius = min(thread_min_radius, radius);
// ...
}
#pragma omp critical (reduce_minmax_radius)
{
max_particle_radius_ = max(max_particle_radius_, thread_max_radius);
min_particle_radius_ = min(min_particle_radius_, thread_min_radius);
}
}
The two max_particle_radius_ and min_particle_radius_ are actually members of the same class this code is executed in, so basically they are automatically shared in the parallel region.
The intel inspector then gives me following error, which I can't really understand:
How can there be a data race, if only one thread at the time can enter the critical section? Am I missing something about how OpenMP works?
Related
void x(vector<vector<int>>&A,vector<vector<int>>&B,vector<vector<int>>&C,int n)
{
int i,j;
#pragma omp parallel
#pragma omp single
{
for (i=0;i<n;++i)
for(j=0;j<n;++j)
#pragma omp task
C[i][j] = A[i][j]+B[i][j];
}
}
This code gives segmentation fault.
I am sure that the parallel directive will create multiple threads and then the single region will be accessed by all threads and task is created for each (i,j)th entry so n*n tasks will be created and all are independent hence there should be no race condition.
But it gives segmentation fault. Please help.
I am trying to do distributed search using omp.h. I am creating 4 threads. Thread with id 0 does not perform the search instead it overseas which thread has found the number in array. Below is my code:
int arr[15]; //This array is randomly populated
int process=0,i=0,size=15; bool found=false;
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp cancellation point parallel
if(thread_id==0){
while(found==false){ continue; }
if(found==true){
cout<<"Number found by thread: "<<process<<endl;
#pragma omp cancel parallel
}
}
else{
#pragma omp parallel for schedule(static,5)
for(i=0;i<size;i++){
if(arr[i]==number){ //number is a int variable and its value is taken from user
found = true;
process = thread_id;
}
cout<<i<<endl;
}
}
}
The problem i am having is that each thread is executing for loop from i=0 till i=14. According to my understanding omp divides the iteration of the loops but this is not happening here. Can anyone tell me why and its possible solution?
Your problem is that you have a parallel inside a parallel. That means that each thread from the first parallel region makes a new team. That is called nested parallelism and it is allowed, but by default it's turned off. So each thread creates a team of 1 thread, which then executes its part of the for loop, which is the whole loop.
So your omp parallel for should be omp for.
But now there is another problem: your loop is going to be distributed over all threads, except that thread zero never gets to the loop. So you get deadlock.
.... and the actual solution to your problem is a lot more complicated. It involves creating two tasks, one that spins on the shared variable, and one that does the parallel search.
#pragma omp parallel
{
# pragma omp single
{
int p = omp_get_num_threads();
int found = 0;
# pragma omp taskgroup
{
/*
* Task 1 listens to the shared variable
*/
# pragma omp task shared(found)
{
while (!found) {
if (omp_get_thread_num()<0) printf("spin\n");
continue; }
printf("found!\n");
# pragma omp cancel taskgroup
} // end 1st task
/*
* Task 2 does something in parallel,
* sets `found' to true if found
*/
# pragma omp task shared(found)
{
# pragma omp parallel num_threads(p-1)
# pragma omp for
for (int i=0; i<p; i++)
// silly test
if (omp_get_thread_num()==2) {
printf("two!\n");
found = 1;
}
} // end 2nd task
} // end taskgroup
}
}
(Do you note the printf that is never executed? I needed that to prevent the compiler from "optimizing away" the empty while loop.)
Bonus solution:
#pragma omp parallel num_threads(4)
{
if(omp_get_thread_num()==0){ spin_on_found; }
if(omp_get_thread_num()!=0){
#pragma omp for nowait schedule(dynamic)
for ( loop ) stuff
The combination of dynamic and nowait can somehow deal with the missing thread.
#Victor Eijkhout already explained what happened here, I just want to show you a simpler (and data race free) solution.
Note that OpenMP has a significant overhead, in your case the overheads are bigger than the gain by parallelization. So, the best idea is not to use parallelization in this case.
If you do some expensive work inside the loop, the simplest solution is to skip this expensive work if it is not necessary. Note that I have used #pragma omp critical before found = true; to avoid data race.
#pragma omp parallel for
for(int i=0; i<size;i++){
if(found) continue;
// some expensive work here
if(CONDITION){
#pragma omp critical
found = true;
}
}
Another alternative is to use #pragma omp cancel for
#pragma omp parallel
#pragma omp for
for(int i=0; i<size;i++){
#pragma omp cancellation point for
// some expensive work here
if(CONDITION){
//cancelling the for loop
#pragma omp cancel for
}
}
In the following code, I have created a parallel region using the #pragma omp parallel.
Within, the parallel region, there is a section of code that needs to be executed by only one thread which is achieved using #pragma omp single nowait.
Inside, the sequential region their is a FOR loop which can parallelized and I using #pragma omp taskloop to achieve it.
After the loop is done, I have used #pragma omp taskwait so as to make sure that the rest of the code is executed by only one thread. However, it seems the is not behaving as I am expecting. Multiple threads are accessing the section of the code after the #pragma omp taskwait which is declared under the region defined as #pragma omp single nowait.
std::vector<std::unordered_map<int, int>> veg_ht(n_comp + 1);
vec_ht[0].insert({root_comp_id, root_comp_node});
#pragma omp parallel
{
#pragma omp single
{
int nthreads = omp_get_num_threads();
for (int l = 0; l < n_comp; ++l) {
int bucket_count = vec_ht[l].bucket_count();
#pragma omp taskloop
for (int bucket_id = 0; bucket_id < bucket_count; ++bucket_id) {
if (vec_ht[l].bucket_size(bucket_id) == 0) { continue; }
int thread_id = omp_get_thread_num();
for (auto it_vec_ht = vec_ht[l].begin(bucket_id); it_vec_ht != vec_ht[l].end(bucket_id); ++it_vec_ht) {
// some operation --code removed for minimality
} // for it_vec_ht[l]
} // for bucket_id taskloop
#pragma omp taskwait
// Expected that henceforth all code will be accessed by one thread only
for (int tid = 0; tid < nthreads; ++tid) {
// some operation --code removed for minimality
} // for tid
} // for l
} // pragma omp single nowait
} // pragma parallel
It doesn't look like you necessarily need to use the enclosing parallel/single/taskloop layout. If you aren't going to specify the number of threads, then your system should default to using the maximum number of threads available. You can get this value outside of an OMP construct using omp_get_max_threads()'. Then you can use just the taskloop structure, or just replace it with a#pragma omp parallel for`.
I think the issue with your code is the #pragma omp taskwait line. The single thread should fork into many threads when it hits the taskloop construct, and then collapse back to a single thread afterwards. I think you might be triggering a brand new forking of your single thread with the #pragma omp taskwait command. An alternative to #pragma omp taskwait that definitely doesn't trigger thread forking is #pragma omp barrier. I think making this substitution will make your code work in its current form.
I'm sitting with some stuff here trying to make orphaning work, and reduce the overhead by reducing the calls of #pragma omp parallel.
What I'm trying is something like:
#pragma omp parallel default(none) shared(mat,mat2,f,max_iter,tol,N,conv) private(diff,k)
{
#pragma omp master // I'm not against using #pragma omp single or whatever will work
{
while(diff>tol) {
do_work(mat,mat2,f,N);
swap(mat,mat2);
if( !(k%100) ) // Only test stop criteria every 100 iteration
diff = conv[k] = do_more_work(mat,mat2);
k++;
} // end while
} // end master
} // end parallel
The do_work depends on the previous iteration so the while-loop is has to be run sequential.
But I would like to be able to run the ´do_work´ parallel, so it would look something like:
void do_work(double *mat, double *mat2, double *f, int N)
{
int i,j;
double scale = 1/4.0;
#pragma omp for schedule(runtime) // Just so I can test different settings without having to recompile
for(i=0;i<N;i++)
for(j=0;j<N;j++)
mat[i*N+j] = scale*(mat2[(i+1)*N+j]+mat2[(i-1)*N+j] + ... + f[i*N+j]);
}
I hope this can be accomplished some way, I'm just not sure how. So any help I can get is greatly appreciated (also if you're telling me this isn't possible). Btw I'm working with open mp 3.0, the gcc compiler and the sun studio compiler.
The outer parallel region in your original code contains only a serial piece (#pragma omp master), which makes no sense and effectively results in purely serial execution (no parallelism). As do_work() depends on the previous iteration, but you want to run it in parallel, you must use synchronisation. The openmp tool for that is an (explicit or implicit) synchronisation barrier.
For example (code similar to yours):
#pragma omp parallel
for(int j=0; diff>tol; ++j) // must be the same condition for each thread!
#pragma omp for // note: implicit synchronisation after for loop
for(int i=0; i<N; ++i)
work(j,i);
Note that the implicit synchronisation ensures that no thread enters the next j if any thread is still working on the current j.
The alternative
for(int j=0; diff>tol; ++j)
#pragma omp parallel for
for(int i=0; i<N; ++i)
work(j,i);
should be less efficient, as it creates a new team of threads at each iteration, instead of merely synchronising.
Why won't the intel compiler let me specify that some actions in an openmp parallel for block should be executed by the master thread only?
And how can I do what I'm trying to achieve without this kind of functionality?
What I'm trying to do is update a progress bar through a callback in a parallel for:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
#pragma omp master
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
I want only the master thread to call the callback, because if I don't enforce that (say by using omp critical instead to ensure only one thread uses the callback at once) I get the following runtime exception:
The application called an interface that was marshalled for a different thread.
...hence the desire to keep all callbacks in the master thread.
Thanks in advance.
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
Compiler Error C3034
OpenMP 'master' directive cannot be directly nested within 'parallel for' directive
Visual Studio 2010 OpenMP 2.0
May be so:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
//#pragma omp master it is error
//#pragma omp critical it is right
if (omp_get_thread_num() == 0) // may be good
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
The reason why you get the error is because the master thread isn't there most of the times when the code reaches the #pragma omp master line.
For example, let's take the code from Artyom:
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
If the code would compile, the following could happen:
Let's say thread 0 starts (the master thread). It reaches the pragma that practically says "Master, do the following piece of code". It being the master can run the function.
However, what happens when thread 1 or 2 or 3, etc, reaches that piece of code?
The master directive is telling the present/listening team that the master thread has to execute f(). But the team is a single thread and there is no master present. The program wouldn't know what to do past that point.
And that's why, I think, the master isn't allowed to be inside the for-loop.
Substituting the master directive with if (omp_get_thread_num() == 0) works because now the program says, "If you are master, do this. Otherwise ignore".