Does OpenMP with target offloading on the GPU include a global memory fence / global barrier, similar to OpenCL?
barrier(CLK_GLOBAL_MEM_FENCE);
I've tried using inside a teams construct
#pragma omp target teams
{
// Some initialization...
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some work...
}
#pragma omp barrier
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some other work depending on pervious loop
}
}
However it seams that the barrier only works within a team, equivalent to:
barrier(CLK_LOCAL_MEM_FENCE);
I would like to avoid splitting the kernel into two, to avoid sending team local data to global memory just to load it again.
Edit: I've been able enforce the desired behavior using a global atomic counter and busy waiting of the teams. However this doesn't seem like a good solution, and I'm still wondering if there is a better way to do this using proper OpenMP
A barrier construct only synchronizes threads in the current team. Synchronization between threads from different thread teams launched by a teams construct is not available. OpenMP's execution model doesn't guarantee that such threads will even execute concurrently, so using atomic constructs to synchronize between the threads will not work in general:
Whether the initial threads concurrently execute the teams region is
unspecified, and a program that relies on their concurrent execution for the
purposes of synchronization may deadlock.
Note that the OpenCL barrier call only provides synchronization within a workgroup, even with the CLK_GLOBAL_MEM_FENCE argument. See Barriers in OpenCL for more information on semantics of CLK_GLOBAL_MEM_FENCE versus CLK_LOCAL_MEM_FENCE.
Related
I use openomp in my service for parallelizing my loop. But every time when a request came in, my service will create a brand new thread for it, and this thread will use omp to create a thread pool. Can I ask when this thread pool will be detructed?
void foo() {
#pragma omp parallel for schedule(dynamic, 1)
// Do something
}
int main() {
std::vector<std::thread> threads;
for (int i = 0; i < x; i++) {
threads.push_back(
std::thread(foo);
);
}
for (auto& thread : threads) {
thread.join();
}
}
In this pseudo code, I noticed that:
In the for loop, the thread num is 8 * x + 1(8 cores host, some 8 omp threads for each std::thread, and 1 main thread).
After the for loop, the thread num get back to 1, which means all omp thread pools get destructed.
This can be reproduced in this simple code, but for some more complex situation but similar use cases, I noticed the thread pools didn't get destructed after their parent thread finished. So it is hard for me to understand why.
So can I ask when the thread pool of omp will get destructed?
The creation and deletion of the native threads of an OpenMP parallel region is left to the OpenMP implementation (eg. IOMP for ICC/Clang, GOMP for GCC) and is not defined by the OpenMP specification. The specification do not restrict implementations to create native threads at the beginning of a parallel region nor to delete them at the end. In fact, most implementation keep the threads alive as long as possible because creating threads is slow (especially on many-core architectures). The specification explicitly mention the difference between native threads and basic OpenMP threads/tasks (everything is a task in OpenMP 5). Note that OMPT can be used to track when native threads are created and deleted. I expect mainstream implementation to create threads during the runtime initialization (typically when the first parallel section is encountered) and to delete threads when the program ends.
The specification states:
[A native thread is] a thread defined by an underlying thread implementation
If the parallel region creates a native thread, a native-thread-begin event occurs as the first event in the context of the new thread prior to the implicit-task-begin event.
If a native thread is destroyed at the end of a parallel region, a native-thread-end event occurs in the thread as the last event prior to destruction of the thread.
Note that implementations typically destroy and recreate new threads when the number of threads of a parallel region is different from the previous one. This also happens in pathological cases like nesting.
The documentation of GOMP is available here but it is not very detailed. The one of IOMP is available here and is not much better... You can find interesting information directly in the code of the runtimes. For example, in the GOMP code. Note that there are useful comments like:
We only allow the reuse of idle threads for non-nested PARALLEL regions
I am trying to parallelize my C++ code using OpenMP.
So this is my first time with OpenMP and I have a couple of questions about how to use private / shared properly
Below is just a sample code I wrote to understand what is going on. Correct me if I am wrong.
#pragma omp parallel for
for (int x=0;x<100;x++)
{
for (int y=0;y<100;y++)
{
for (int z=0;z<100;z++)
{
a[x][y][z]=U[x]+U[y]+U[z];
}
}
}
So by using #pragma omp parallel for I can use multiple threads to do this loop i.e with 5 threads, #1 thread use 0<=x<20, #2 thread use 20<=x<40 ... 80 <=x<100.
And each thread runs at the same time. So by using this, I can make this code faster.
Since x, y, and z are declared inside the loop, they are private (each thread will have a copy version of these variables), a and U are shared.
So each thread reads a shared variable U and writes to a shared variable a.
I have a couple of questions.
What would be the difference between #pragma omp parallel for and #pragma omp parallel for private(y,z)? I think since x, y, and z are already private, they should be the same.
If I use #pragma omp parallel for private(a, U), does this mean each thread will have a copy of a and U?
For example, with 2 threads that have a copy of a and U, thread #1 use 0<=x<50 so that it writes from a[0][0][0] to a[49][99][99] and thread #2 writes from a[50][0][0] to a[99][99][99]. And after that they merge these two results so that they have complete version of a[x][y][z]?
Any variable declared within a parallel block will be private. Variables mentioned in the private clause of a parallel directive follow the normal rules for variables: the variable must already be declared at the point it is used.
The effect of private is to create a copy of the variable for each thread. Then the threads can update the value without worrying about changes that could be made by other threads. At the end of the parallel block, the values are generally lost unless there are other clauses included in the parallel directive. The reduction directive is the most common, as it can combine the results from each thread into a final result for the loop.
Near the start of my c++ application, my main thread uses OMP to parallelize several for loops. After the first parallelized for loop, I see that the threads used remain in existence for the duration of the application, and are reused for subsequent OMP for loops executed from the main thread, using the command (working in CentOS 7):
for i in $(pgrep myApplication); do ps -mo pid,tid,fname,user,psr -p $i;done
Later in my program, I launch a boost thread from the main thread, in which I parallelize a for loop using OMP. At this point, I see an entirely new set of threads are created, which has a decent amount of overhead.
Is it possible to make the OMP parallel for loop within the boost thread reuse the original OMP thread pool created by the main thread?
Edit: Some pseudo code:
myFun(data)
{
// Want to reuse OMP thread pool from main here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// Work on data
}
}
main
{
// Thread pool created here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// do stuff
}
boost::thread myThread(myFun) // Constructor starts thread.
// Do some serial stuff, no OMP.
myThread.join();
}
The interaction of OpenMP with other threading mechanisms is deliberately left out of the specification and is therefore dependent heavily on the implementation. The GNU OpenMP runtime keeps a pointer to the thread pool in TLS and propagates it down the (nested) teams. Threads started via pthread_create (or boost::thread or std::thread) do not inherit the pointer and therefore spawn a fresh pool. It is probably the case with other OpenMP runtimes too.
There is a requirement in the standard that basically forces such behaviour in most implementations. It is about the semantics of the threadprivate variables and how their values are retained across the different parallel regions forked from the same thread (OpenMP standard, 2.15.2 threadprivate Directive):
The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all of the following conditions hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The thread affinity policies used to execute both parallel regions are the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.
If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.
This, besides performance, is probably the main reason for using thread pools in OpenMP runtimes.
Now, imagine that two parallel regions forked by two separate threads share the same worker thread pool. A parallel region was forked by the first thread and some threadprivate variables were set. Later a second parallel region is forked by the same thread, where those threadprivate variables are used. But somewhere between the two parallel regions, a parallel region is forked by the second thread and worker threads from the same pool are utilised. Since most implementations keep threadprivate variables in TLS, the above semantics can no longer be asserted. A possible solution would be to add new worker threads to the pool for each separate thread, which is not much different than creating new thread pools.
I'm not aware of any workarounds to make the worker thread pool shared. And if possible, it will not be portable, therefore the main benefit of OpenMP will be lost.
Does OMP ensure that the contents of an dynamic array is up-to-date and is visible to all threads after an OMP barrier?
Yes. A barrier causes all threads' view of all accessible memory to be made consistent; that is, it implicitly flushes the entire state of the program.
if your array is out of the #pragma omp parallel construct, it will automatically accessible & share by all the thread.
But the way he is update by the thread only depend if your algo and the synchro mechanism you use to ensure the correctness.
I have one function that I'm attempting to parallelize with OpenMP. I has a big for loop where every iteration is independent of the others, and I'd like to use something like
#pragma omp for private(j)
to parallelize the loop.
One problem is that each iteration of the loop requires a substantial amount of temporary workspace, enough that I think it will likely kill performance if I allocate and deallocate this temporary workspace with once per iteration. My environment has "workspace" objects in it, and there are no problem associated with reusing an old workspace object as-is.
How can I allocate workspace for each thread before the threads are made (and I don't know how many of them there are)? How can I tell each thread to pick a unique workspace object from the pool?
You can use omp_get_max_threads() and allocate enough workspaces for all threads (e.g., an array of workspaces with omp_get_max_threads() elements.), and then on each thread use omp_get_thread_num() to know which thread is running on so it can get its own workspace.
Maybe I am missing the point, but wouldn't the following strategy work for you?
void foo() {
#pragma omp parallel
{
// allocate work-space here, so to make it private to the thread
thread_workspace t;
#pragma omp for
for(int j = 0; j < N; j++) {
// Each thread has its local work-space allocated outside the for loop
}
} // End of the parallel region
}
I recommend using the Object Pool design pattern. Here's a description. You would obviously need to make the acquire and release methods for the workspaces thread safe (3 methods in the ReusablePool need synchronization). The number of workspaces would grow to the total number needed at any one time. Reclaimed workspaces would be reused by the ReusablePool.
Although the object pool is handling the object instantiation it's
main purpose is to provide a way for the clients to reuse the objects
like they are new objects.