I am trying to parallelize my C++ code using OpenMP.
So this is my first time with OpenMP and I have a couple of questions about how to use private / shared properly
Below is just a sample code I wrote to understand what is going on. Correct me if I am wrong.
#pragma omp parallel for
for (int x=0;x<100;x++)
{
for (int y=0;y<100;y++)
{
for (int z=0;z<100;z++)
{
a[x][y][z]=U[x]+U[y]+U[z];
}
}
}
So by using #pragma omp parallel for I can use multiple threads to do this loop i.e with 5 threads, #1 thread use 0<=x<20, #2 thread use 20<=x<40 ... 80 <=x<100.
And each thread runs at the same time. So by using this, I can make this code faster.
Since x, y, and z are declared inside the loop, they are private (each thread will have a copy version of these variables), a and U are shared.
So each thread reads a shared variable U and writes to a shared variable a.
I have a couple of questions.
What would be the difference between #pragma omp parallel for and #pragma omp parallel for private(y,z)? I think since x, y, and z are already private, they should be the same.
If I use #pragma omp parallel for private(a, U), does this mean each thread will have a copy of a and U?
For example, with 2 threads that have a copy of a and U, thread #1 use 0<=x<50 so that it writes from a[0][0][0] to a[49][99][99] and thread #2 writes from a[50][0][0] to a[99][99][99]. And after that they merge these two results so that they have complete version of a[x][y][z]?
Any variable declared within a parallel block will be private. Variables mentioned in the private clause of a parallel directive follow the normal rules for variables: the variable must already be declared at the point it is used.
The effect of private is to create a copy of the variable for each thread. Then the threads can update the value without worrying about changes that could be made by other threads. At the end of the parallel block, the values are generally lost unless there are other clauses included in the parallel directive. The reduction directive is the most common, as it can combine the results from each thread into a final result for the loop.
Related
Does OpenMP with target offloading on the GPU include a global memory fence / global barrier, similar to OpenCL?
barrier(CLK_GLOBAL_MEM_FENCE);
I've tried using inside a teams construct
#pragma omp target teams
{
// Some initialization...
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some work...
}
#pragma omp barrier
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some other work depending on pervious loop
}
}
However it seams that the barrier only works within a team, equivalent to:
barrier(CLK_LOCAL_MEM_FENCE);
I would like to avoid splitting the kernel into two, to avoid sending team local data to global memory just to load it again.
Edit: I've been able enforce the desired behavior using a global atomic counter and busy waiting of the teams. However this doesn't seem like a good solution, and I'm still wondering if there is a better way to do this using proper OpenMP
A barrier construct only synchronizes threads in the current team. Synchronization between threads from different thread teams launched by a teams construct is not available. OpenMP's execution model doesn't guarantee that such threads will even execute concurrently, so using atomic constructs to synchronize between the threads will not work in general:
Whether the initial threads concurrently execute the teams region is
unspecified, and a program that relies on their concurrent execution for the
purposes of synchronization may deadlock.
Note that the OpenCL barrier call only provides synchronization within a workgroup, even with the CLK_GLOBAL_MEM_FENCE argument. See Barriers in OpenCL for more information on semantics of CLK_GLOBAL_MEM_FENCE versus CLK_LOCAL_MEM_FENCE.
I am using OpenMP successful to parallelize for loops in my c++ code. I tried to
step further and use OpenMP tasks. Unfortunately my code behaves
really strange, so i wrote a minimal example and found a problem.
I would like to define a couple of tasks. Each task should be executed once
by an idle thread.
Unfortunately i can only make all threads execute every task or
only one thread performing all tasks sequentially.
Here is my code which basically runs sequentially:
int main() {
#pragma omp parallel
{
int id, nths;
id = omp_get_thread_num();
#pragma omp single nowait
{
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
#pragma omp task
cout<<"My id is "<<id<<endl;
}
}
return 0;
}
Only worker 0 shows up and gives his id four times.
I expected to see "My id is 0; My id is 1; my id is 2; my id is 3;
If i delete #pragma omp single i get 16 messages, all threads execute
every single cout.
Is this a problem with my OpenMP setup or did I not get something about
tasks? I am using gcc 6.3.0 on Ubuntu and use -fopenmp flag properly.
Your basic usage of OpenMP tasks (parallel -> single -> task) is correct, you misunderstand the intricacies of data-sharing attributes for variables.
First, you can easily confirm that your tasks are run by different threads by moving omp_get_thread_num() inside the task instead of accessing id.
What happens in your example, id becomes implicitly private within the parallel construct. However, inside the task, it becomes implicitly firstprivate. This means, the task copies the value from the thread that executes the single construct. A more elaborate discussion of a similar issue can be found here.
Note that if you used private within a nested task construct, it would not be the same private variable as the one of the outside parallel construct. Simply said, private does not refer to the thread, but the construct. That's the difference to threadprivate. However, threadprivate is not an attribute to a construct, but it's own directive and only applies to variables with file-scope, namespace-scope or static variables with block-scope.
I am trying to have a parallel region which inside it has first a parallel for, then a function call with a parallel for inside and lastly another parallel for.
A simplified example could be this
#pragma parallel
{
#pragma omp for
for(int i=0;i<1000;i++)
position[i]+=velocity[i];
calculateAccelerationForAll();
#pragma omp for
for(int i=0;i<1000;i++)
velocity[i]+=acceleration[i];
}
calculateAccelerationForAll()
{
#pragma parallel omp for
for(int i=0;i<1000;i++)
for(int j=0;j<1000;j++)
acceleration[i]=docalculation
}
The issue here being that I would want the existing threads to jump over into calculateAccelerationForAll and execute the for loop there, rather than having three separated parallel regions. I could ensure that only the first thread actually calls the function, and have a barrier after the function call, but then only that thread executes the for loop inside the function.
The question is really if my assumption, that putting the first and last loop in their own paralle region and making the function call have its own region as well, is inefficient, is false... or if it is correct, how I can then make one regions thread go through it all the way.
Might add, that if I just took the contents of the function and put it inside the main paralle region, between the two existing loops, then it would not be an issue. The problem (for me at least) is that I have to use a function call and make then run in parallel as well.
It helped typing out the problem, it seems.
The obvious answer is to change the pragma in the function
from #pragma parallel for to #pragma for
That makes that for loop use the existing threads from the existing calling parallel section, and it works perfectly.
Near the start of my c++ application, my main thread uses OMP to parallelize several for loops. After the first parallelized for loop, I see that the threads used remain in existence for the duration of the application, and are reused for subsequent OMP for loops executed from the main thread, using the command (working in CentOS 7):
for i in $(pgrep myApplication); do ps -mo pid,tid,fname,user,psr -p $i;done
Later in my program, I launch a boost thread from the main thread, in which I parallelize a for loop using OMP. At this point, I see an entirely new set of threads are created, which has a decent amount of overhead.
Is it possible to make the OMP parallel for loop within the boost thread reuse the original OMP thread pool created by the main thread?
Edit: Some pseudo code:
myFun(data)
{
// Want to reuse OMP thread pool from main here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// Work on data
}
}
main
{
// Thread pool created here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// do stuff
}
boost::thread myThread(myFun) // Constructor starts thread.
// Do some serial stuff, no OMP.
myThread.join();
}
The interaction of OpenMP with other threading mechanisms is deliberately left out of the specification and is therefore dependent heavily on the implementation. The GNU OpenMP runtime keeps a pointer to the thread pool in TLS and propagates it down the (nested) teams. Threads started via pthread_create (or boost::thread or std::thread) do not inherit the pointer and therefore spawn a fresh pool. It is probably the case with other OpenMP runtimes too.
There is a requirement in the standard that basically forces such behaviour in most implementations. It is about the semantics of the threadprivate variables and how their values are retained across the different parallel regions forked from the same thread (OpenMP standard, 2.15.2 threadprivate Directive):
The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all of the following conditions hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The thread affinity policies used to execute both parallel regions are the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.
If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.
This, besides performance, is probably the main reason for using thread pools in OpenMP runtimes.
Now, imagine that two parallel regions forked by two separate threads share the same worker thread pool. A parallel region was forked by the first thread and some threadprivate variables were set. Later a second parallel region is forked by the same thread, where those threadprivate variables are used. But somewhere between the two parallel regions, a parallel region is forked by the second thread and worker threads from the same pool are utilised. Since most implementations keep threadprivate variables in TLS, the above semantics can no longer be asserted. A possible solution would be to add new worker threads to the pool for each separate thread, which is not much different than creating new thread pools.
I'm not aware of any workarounds to make the worker thread pool shared. And if possible, it will not be portable, therefore the main benefit of OpenMP will be lost.
Does OMP ensure that the contents of an dynamic array is up-to-date and is visible to all threads after an OMP barrier?
Yes. A barrier causes all threads' view of all accessible memory to be made consistent; that is, it implicitly flushes the entire state of the program.
if your array is out of the #pragma omp parallel construct, it will automatically accessible & share by all the thread.
But the way he is update by the thread only depend if your algo and the synchro mechanism you use to ensure the correctness.