Hi I am reading this website http://www.viva64.com/en/a/0054/ and for point number 17, it says that the code below without the barrier is wrong. Why ? I read at http://bisqwit.iki.fi/story/howto/openmp/#BarrierDirectiveAndTheNowaitClause there is an implicit barrier at the end of each parallel block, and at the end of each sections, for and single statement, unless the nowait directive is used.
struct MyType
{
~MyType();
};
MyType threaded_var;
#pragma omp threadprivate(threaded_var)
int main()
{
#pragma omp parallel
{
...
#pragma omp barrier // code is wrong without barrier.
}
}
Someone explain to me please. Thanks
The linked web page is wrong about that point. There actually is an implicit barrier at the end of the parallel section.
Since the web site seems to have a Windows focus and MS only supports the OpenMP standard 2.0, it might be worth noting that this implicit barrier is not only in the current standard 4.5 but also in version 2.0:
Upon completion of the parallel construct, the threads in the team
synchronize at an implicit barrier, [...]
http://www.openmp.org/mp-documents/cspec20.pdf
Related
Does OpenMP with target offloading on the GPU include a global memory fence / global barrier, similar to OpenCL?
barrier(CLK_GLOBAL_MEM_FENCE);
I've tried using inside a teams construct
#pragma omp target teams
{
// Some initialization...
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some work...
}
#pragma omp barrier
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some other work depending on pervious loop
}
}
However it seams that the barrier only works within a team, equivalent to:
barrier(CLK_LOCAL_MEM_FENCE);
I would like to avoid splitting the kernel into two, to avoid sending team local data to global memory just to load it again.
Edit: I've been able enforce the desired behavior using a global atomic counter and busy waiting of the teams. However this doesn't seem like a good solution, and I'm still wondering if there is a better way to do this using proper OpenMP
A barrier construct only synchronizes threads in the current team. Synchronization between threads from different thread teams launched by a teams construct is not available. OpenMP's execution model doesn't guarantee that such threads will even execute concurrently, so using atomic constructs to synchronize between the threads will not work in general:
Whether the initial threads concurrently execute the teams region is
unspecified, and a program that relies on their concurrent execution for the
purposes of synchronization may deadlock.
Note that the OpenCL barrier call only provides synchronization within a workgroup, even with the CLK_GLOBAL_MEM_FENCE argument. See Barriers in OpenCL for more information on semantics of CLK_GLOBAL_MEM_FENCE versus CLK_LOCAL_MEM_FENCE.
Is it possible to control the openmp thread that is used to execute a particular task?
In other words say that we have the following three tasks:
#pragma omp parallel
#pragma omp single
{
#pragma omp task
block1();
#pragma omp task
block2();
#pragma omp task
block3();
}
Is it possible to control the set of openmp threads that the openmp scheduler chooses to execute each of these three tasks? The idea is that if I have used openmp's thread affinity mechanism to bind openmp threads to particular numa nodes, I want to make sure that each task is executed by the appropriate numa node core. Is this possible in Openmp 4.5? Is it possible in openmp 5.0?
In a certain sense, this can be accomplished using the affinity clause that has been introduced with the OpenMP API version 5.0. What you can do is this:
float * a = ...
float * b = ...
float * c = ...
#pragma omp parallel
#pragma omp single
{
#pragma omp task affinity(a)
block1();
#pragma omp task affinity(b)
block2();
#pragma omp task affinity(c)
block3();
}
The OpenMP implementation would then determine where the data of a, b, and c has been allocated (so, in which NUMA domain of the system) and schedule the respective task for execution on a thread in that NUMA domain. Please note, that this is a mere hint to the OpenMP implementation and that it can ignore the affinity clause and still execute the on a different thread that is not close to the data.
Of course, you will have to use an OpenMP implementation that already supports the affinity clause and does more than simply ignore it.
Other than the above, there's no OpenMP conforming way to assign a specific task to a specific worker thread for execution.
I am trying to parallelize my C++ code using OpenMP.
So this is my first time with OpenMP and I have a couple of questions about how to use private / shared properly
Below is just a sample code I wrote to understand what is going on. Correct me if I am wrong.
#pragma omp parallel for
for (int x=0;x<100;x++)
{
for (int y=0;y<100;y++)
{
for (int z=0;z<100;z++)
{
a[x][y][z]=U[x]+U[y]+U[z];
}
}
}
So by using #pragma omp parallel for I can use multiple threads to do this loop i.e with 5 threads, #1 thread use 0<=x<20, #2 thread use 20<=x<40 ... 80 <=x<100.
And each thread runs at the same time. So by using this, I can make this code faster.
Since x, y, and z are declared inside the loop, they are private (each thread will have a copy version of these variables), a and U are shared.
So each thread reads a shared variable U and writes to a shared variable a.
I have a couple of questions.
What would be the difference between #pragma omp parallel for and #pragma omp parallel for private(y,z)? I think since x, y, and z are already private, they should be the same.
If I use #pragma omp parallel for private(a, U), does this mean each thread will have a copy of a and U?
For example, with 2 threads that have a copy of a and U, thread #1 use 0<=x<50 so that it writes from a[0][0][0] to a[49][99][99] and thread #2 writes from a[50][0][0] to a[99][99][99]. And after that they merge these two results so that they have complete version of a[x][y][z]?
Any variable declared within a parallel block will be private. Variables mentioned in the private clause of a parallel directive follow the normal rules for variables: the variable must already be declared at the point it is used.
The effect of private is to create a copy of the variable for each thread. Then the threads can update the value without worrying about changes that could be made by other threads. At the end of the parallel block, the values are generally lost unless there are other clauses included in the parallel directive. The reduction directive is the most common, as it can combine the results from each thread into a final result for the loop.
Does OMP ensure that the contents of an dynamic array is up-to-date and is visible to all threads after an OMP barrier?
Yes. A barrier causes all threads' view of all accessible memory to be made consistent; that is, it implicitly flushes the entire state of the program.
if your array is out of the #pragma omp parallel construct, it will automatically accessible & share by all the thread.
But the way he is update by the thread only depend if your algo and the synchro mechanism you use to ensure the correctness.
Is it ok to use omp pragmas like critical, single, master, or barrier outside of an omp parallel block? I have a function that can be called either from an OMP parallel block, or not. If yes, I need to enclose part of the code in a critical section. In other words, is this code fine?
void myfunc(){
#pragma omp critical
{ /* code */ }
}
// not inside an omp parallel region
myfunc();
#pragma omp parallel
{
// inside an omp parallel region
myfunc();
}
I have found no mention of this in the OpenMP documentation. I guess the code should behave exactly like with 1 thread execution - and this is how it works with gcc. I would like to know if this behavior is portable, or is it something that the specification does not define and anything can be expected.
According to this document:
The DO/for, SECTIONS, SINGLE, MASTER and BARRIER directives bind to the dynamically enclosing PARALLEL, if one exists. If no parallel region is currently being executed, the directives have no effect.
So the answer is those pragmas can be used outside a parallel region. Although I still do not find it explicitly written in the documentation.