When will omp thread pool get destructed - c++

I use openomp in my service for parallelizing my loop. But every time when a request came in, my service will create a brand new thread for it, and this thread will use omp to create a thread pool. Can I ask when this thread pool will be detructed?
void foo() {
#pragma omp parallel for schedule(dynamic, 1)
// Do something
}
int main() {
std::vector<std::thread> threads;
for (int i = 0; i < x; i++) {
threads.push_back(
std::thread(foo);
);
}
for (auto& thread : threads) {
thread.join();
}
}
In this pseudo code, I noticed that:
In the for loop, the thread num is 8 * x + 1(8 cores host, some 8 omp threads for each std::thread, and 1 main thread).
After the for loop, the thread num get back to 1, which means all omp thread pools get destructed.
This can be reproduced in this simple code, but for some more complex situation but similar use cases, I noticed the thread pools didn't get destructed after their parent thread finished. So it is hard for me to understand why.
So can I ask when the thread pool of omp will get destructed?

The creation and deletion of the native threads of an OpenMP parallel region is left to the OpenMP implementation (eg. IOMP for ICC/Clang, GOMP for GCC) and is not defined by the OpenMP specification. The specification do not restrict implementations to create native threads at the beginning of a parallel region nor to delete them at the end. In fact, most implementation keep the threads alive as long as possible because creating threads is slow (especially on many-core architectures). The specification explicitly mention the difference between native threads and basic OpenMP threads/tasks (everything is a task in OpenMP 5). Note that OMPT can be used to track when native threads are created and deleted. I expect mainstream implementation to create threads during the runtime initialization (typically when the first parallel section is encountered) and to delete threads when the program ends.
The specification states:
[A native thread is] a thread defined by an underlying thread implementation
If the parallel region creates a native thread, a native-thread-begin event occurs as the first event in the context of the new thread prior to the implicit-task-begin event.
If a native thread is destroyed at the end of a parallel region, a native-thread-end event occurs in the thread as the last event prior to destruction of the thread.
Note that implementations typically destroy and recreate new threads when the number of threads of a parallel region is different from the previous one. This also happens in pathological cases like nesting.
The documentation of GOMP is available here but it is not very detailed. The one of IOMP is available here and is not much better... You can find interesting information directly in the code of the runtimes. For example, in the GOMP code. Note that there are useful comments like:
We only allow the reuse of idle threads for non-nested PARALLEL regions

Related

omp global memory fence / barrier

Does OpenMP with target offloading on the GPU include a global memory fence / global barrier, similar to OpenCL?
barrier(CLK_GLOBAL_MEM_FENCE);
I've tried using inside a teams construct
#pragma omp target teams
{
// Some initialization...
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some work...
}
#pragma omp barrier
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some other work depending on pervious loop
}
}
However it seams that the barrier only works within a team, equivalent to:
barrier(CLK_LOCAL_MEM_FENCE);
I would like to avoid splitting the kernel into two, to avoid sending team local data to global memory just to load it again.
Edit: I've been able enforce the desired behavior using a global atomic counter and busy waiting of the teams. However this doesn't seem like a good solution, and I'm still wondering if there is a better way to do this using proper OpenMP
A barrier construct only synchronizes threads in the current team. Synchronization between threads from different thread teams launched by a teams construct is not available. OpenMP's execution model doesn't guarantee that such threads will even execute concurrently, so using atomic constructs to synchronize between the threads will not work in general:
Whether the initial threads concurrently execute the teams region is
unspecified, and a program that relies on their concurrent execution for the
purposes of synchronization may deadlock.
Note that the OpenCL barrier call only provides synchronization within a workgroup, even with the CLK_GLOBAL_MEM_FENCE argument. See Barriers in OpenCL for more information on semantics of CLK_GLOBAL_MEM_FENCE versus CLK_LOCAL_MEM_FENCE.

Thread ID reuse between std::thread and tbb::task_group causing deadlock in OpenMP

*** UPDATE: changing code to a real case that reproduces the problem ***
I'm working in some preexisting code that uses a number of multi-threading techniques; std::thread, plus Intel TBB TaskGroup, plus OpenMP. 🙄 It looks like I've hit a race condition in std::thread join, and potentially one with OpenMP as well. (But of course those libraries were written by smart people, so if there's a bug in the code I'm working with, I hope you can help me figure it out.)
The scenario is that the main thread kicks off a bunch of I/O worker std::threads, which themselves initiate some tasks, and the tasks have some segments of code that use OpenMP for parallelism. The main thread does std::thread::join() to wait for the std::threads, then tbb::TaskGroup::wait() to wait for the tasks to complete.
#include <Windows.h>
#include <tbb/task_group.h>
#include <tbb/concurrent_vector.h>
#include <iostream>
#include <sstream>
#include <thread>
void DoCPUIntensiveWork(int chunkIndex);
int main ()
{
unsigned int hardwareConcurrency = 64;
tbb::concurrent_vector<std::shared_ptr<std::thread>> ioThreads;
tbb::task_group taskGroup;
wprintf(L"Starting %u IO threads\n", hardwareConcurrency);
for (unsigned int cx = 0; cx < hardwareConcurrency; ++cx)
{
ioThreads.push_back(std::shared_ptr<std::thread>(new std::thread([&taskGroup, cx]
{
wprintf(L"IO thread %u starting\r\n", GetCurrentThreadId());
// Not doing any actual IO
taskGroup.run([cx]
{
wprintf(L"CPU task %u starting\r\n", GetCurrentThreadId());
DoCPUIntensiveWork(cx);
wprintf(L"CPU task %u done\r\n", GetCurrentThreadId());
});
//Sleep(1000); Un-commenting this will make the program terminate
wprintf(L"IO thread %u done\r\n", GetCurrentThreadId());
})));
}
// Join the IO workers
for (std::shared_ptr<std::thread>& thread : ioThreads)
{
std::stringstream ss;
ss << thread->get_id();
wprintf(L"Wait for thread %S\r\n", ss.str().c_str());
thread->join(); // main thread hangs here
}
wprintf(L"IO work complete\n");
// And then wait for the CPU tasks
taskGroup.wait();
wprintf(L"CPU work complete\n");
return 0;
}
And the CPU-Intensive work includes usage of OpenMP. (Note, result is the same if I remove the schedule(static).)
// Note: I shrunk these numbers down until the amount of work is actually
// small, not CPU-intensive at all, and it still hangs
static const int GlobalBufferChunkSize = 64;
static const int NumGlobalBufferChunks = 64;
static const int StrideSize = 16;
static const int OverwriteCount = 4;
BYTE GlobalBuffer[NumGlobalBufferChunks * GlobalBufferChunkSize];
void DoCPUIntensiveWork(int chunkIndex)
{
BYTE* pChunk = GlobalBuffer + (chunkIndex * GlobalBufferChunkSize);
#pragma omp parallel for schedule(static)
for (int i = 0; i < (GlobalBufferChunkSize / StrideSize); i++)
{
BYTE* pStride = pChunk + (i * StrideSize);
for (int j = 0; j < OverwriteCount; j++)
{
memset(pStride, i, StrideSize);
}
} // Task thread hangs here
}
This code hangs; the main thread waits on thread->join() forever. Even on a test case that has only a single IO job / CPU-intensive task. I added the printf's you see above, and the result showed that the IO job finished fast, that thread exited, and then the CPU-intensive task spun up with the same thread ID before the main thread even got into the join() call.
Starting 64 IO threads
IO thread 3708 starting
IO thread 23728 starting
IO thread 23352 starting
IO thread 3588 starting
IO thread 3708 done
IO thread 23352 done
IO thread 22080 starting
IO thread 23728 done
IO thread 3376 starting
IO thread 3588 done
IO thread 27436 starting
IO thread 10092 starting
IO thread 22080 done
IO thread 10480 starting
CPU task 3708 starting
IO thread 3376 done
IO thread 27436 done
IO thread 10092 done
IO thread 10480 done
Wait for thread 3708
... hang forever ...
The IO thread ID was reused for a task after the thread finished, and the thread->join() call is still sitting there waiting. When I look in the debugger, thread->join() is waiting on a thread with ID 3708, and a thread with that ID does exist, but that thread is executing the task instead of the IO work. So it appears the primary thread of the process is actually waiting for the task instead of waiting for the IO thread due to the ID reuse. (I can't find docs or code to see if std::thread::join() waits based on the ID or the handle, but it seems like it uses the ID, which would be a bug.)
Second funny thing, that task never completes, and when I look at the thread that's executing the task in the debugger, it's sitting at the end of the OpenMP parallel execution. I don't see any other threads executing parallel work. There are a number of threads from vcomp140[d].dll sitting around in ntdll.dll code, for which I don't have symbols - I presume these are just waiting for new work, not doing my task. The CPU is at 0%. I'm pretty confident nobody is looping. So, the TBB task is hung somewhere in the OpenMP multi-threading implementation.
But, just to make life complicated, the task doesn't seem to hang UNLESS the thread ID from the IO thread happens to be reused for the task. So, somewhere between std::thread and TBB tasks and OpenMP parallelism there's a race condition triggered by thread ID reuse.
I have found two workarounds that make the hang go away:
Put a Sleep(1000) at the end of the IO thread, so IO thread IDs aren't reused by the tasks. The bug is still there waiting for bad timing, of course.
Remove the use of OpenMP parallelism.
A co-worker has suggested a third potential option, to replace OpenMP parallelism with TBB parallel_for. We may do that. Of course this is all layers of code from different sources that we want to touch as little as possible. 🙄
I'm reporting this more as a possible bug report than as a cry for help.
It seems like a bug that std::thread::join() can end up waiting for the wrong thread if a thread ID is reused. It should be waiting by handle, not by ID.
It seems like there's a bug or incompatibility between TBB tasks and OpenMP, such that the OpenMP main thread can hang if it's run on a TBB task that happens to have a thread ID that was reused.
Moving this issue to the TBB GitHub, at https://github.com/oneapi-src/oneTBB/issues/353
UPDATE: the supposition about oversubscription was incorrect. See https://github.com/oneapi-src/oneTBB/issues/353
I think the issue might be caused by OpenMP semantics. By default, it always creates as many threads as hardware concurrency.
TBB will create std::thread::hardware_concurrency() threads and OpenMP will create std::thread::hardware_concurrency() for each TBB worker thread where it is called from. I.e. in the example we will have up to std::thread::hardware_concurrency()*std::thread::hardware_concurrency() threads (+64 IO threads). If the machine is relatively big, e.g. 32+ threads, it will be 32*32 = 1024 threads in the application (overall, it is near a default ulimit or is it Windows?) In any case, such big oversubscription with OpenMP barrier semantics at the end of parallel region can lead to really big execution time (e.g. minutes or even hours).
Why does Sleep(1000) help? I am not sure but it might give some CPU resources to the system to move forward.
To check the idea, add num_threads(1) clause to #pragma omp parallel for num_threads(1) to limit the number of threads created by OpenMP runtime.

C++ condition variables vs new threads for vectorization

I have a block of code that goes through a loop. A section of the code operates on a vector of data and I would like to vectorize this operation. The idea is to split the elaboration of the array on multiple threads that will work on subsections of the array. I have to decide between two possibilities. The first one is to create the threads each time this section is encountered an rejoin them at the end with the main thread:
for(....)
{
//serial stuff
//crate threads
for(i = 0; i < num_threads; ++i)
{
threads_vect.push_back(std::thread(f, sub_array[i]));
}
//join them
for(auto& t : threads_vect)
{
t.join();
}
//serial stuff
}
This is similar at what it is done with OpenMP, but since the problem is simple I'd like to use std::threads instead of OpenMP (unless there are good reasons against this).
The second option is to create the threads beforehand to avoid the overhead of creation and destruction, and use condition variables for synchronization (I omitted a lot of stuff for the synchronization. It is just the general idea):
std::condition_variable cv_threads;
std::condition_variable cv_main;
//create threads, the will be to sleep on cv_threads
for(....)
{
//serial stuff
//wake up threads
cv_threads.notify_all();
//sleep until the last thread finishes, that will notify.
main_thread_lock.lock();
cv_main.wait(main_lock);
//serial stuff
}
To allow for parallelism the threads will have to unlock the thread_lock as soon as they wake up at the beginning of the computation, then acquire it again at to go to sleep and synchronize between them to notify the main thread.
My question is which of this solutions is usually preferred in a context like this, and if the avoided overhead of thread creation and destruction is usually worth the added complexity (or worth at all given that the added synchronization also adds time).
Obviously this also depends on how long the computation is for each thread, but this could vary a lot since the length of the data vector could also be very short (to about two element per thread, that would led to a computation time of about 15 milliseconds).
The biggest disadvantage of creating new threads is that thread creation and shutdown is usually quite expensive. Just think of all the things an operating system has to do to get a thread off the ground, compared to what it takes to notify a condition variable.
Note that synchronization is always required, also with thread creation. The C++11 std::thread for instances introduces a synchronizes-with relationship with the creating thread upon construction. So you can safely assume that thread creation will always be significantly more expensive than condition variable signalling, regardless of your implementation.
A framework like OpenMP will typically attempt to amortize these costs in some way. For instance, an OpenMP implementation is not required to shut down the worker threads after every loop and many implementations will not do this.

How to Reuse OMP Thread Pool, Created by Main Thread, in Worker Thread?

Near the start of my c++ application, my main thread uses OMP to parallelize several for loops. After the first parallelized for loop, I see that the threads used remain in existence for the duration of the application, and are reused for subsequent OMP for loops executed from the main thread, using the command (working in CentOS 7):
for i in $(pgrep myApplication); do ps -mo pid,tid,fname,user,psr -p $i;done
Later in my program, I launch a boost thread from the main thread, in which I parallelize a for loop using OMP. At this point, I see an entirely new set of threads are created, which has a decent amount of overhead.
Is it possible to make the OMP parallel for loop within the boost thread reuse the original OMP thread pool created by the main thread?
Edit: Some pseudo code:
myFun(data)
{
// Want to reuse OMP thread pool from main here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// Work on data
}
}
main
{
// Thread pool created here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// do stuff
}
boost::thread myThread(myFun) // Constructor starts thread.
// Do some serial stuff, no OMP.
myThread.join();
}
The interaction of OpenMP with other threading mechanisms is deliberately left out of the specification and is therefore dependent heavily on the implementation. The GNU OpenMP runtime keeps a pointer to the thread pool in TLS and propagates it down the (nested) teams. Threads started via pthread_create (or boost::thread or std::thread) do not inherit the pointer and therefore spawn a fresh pool. It is probably the case with other OpenMP runtimes too.
There is a requirement in the standard that basically forces such behaviour in most implementations. It is about the semantics of the threadprivate variables and how their values are retained across the different parallel regions forked from the same thread (OpenMP standard, 2.15.2 threadprivate Directive):
The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all of the following conditions hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The thread affinity policies used to execute both parallel regions are the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.
If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.
This, besides performance, is probably the main reason for using thread pools in OpenMP runtimes.
Now, imagine that two parallel regions forked by two separate threads share the same worker thread pool. A parallel region was forked by the first thread and some threadprivate variables were set. Later a second parallel region is forked by the same thread, where those threadprivate variables are used. But somewhere between the two parallel regions, a parallel region is forked by the second thread and worker threads from the same pool are utilised. Since most implementations keep threadprivate variables in TLS, the above semantics can no longer be asserted. A possible solution would be to add new worker threads to the pool for each separate thread, which is not much different than creating new thread pools.
I'm not aware of any workarounds to make the worker thread pool shared. And if possible, it will not be portable, therefore the main benefit of OpenMP will be lost.

C++11 Dynamic Threadpool

Recently, i've been trying to find a library for threading concurrent tasks. Ideally, a simple interface that calls a function on a thread. There are n number of threads at any time, some complete faster than others and arrive at different times.
First i was trying Rx, which is great in c++. I've also looked into Blocks and TBB but they are either platform dependant. For my prototype, i need to remain platform independent as we don't know what it will be running on yet and can change when decisions are made.
C++11 has a number of things for threading and concurrency and i found a number of examples like this one for thread pools.
https://github.com/bilash/threadpool
Similar projects use the same lambda expressions with std::thread and std::mutex.
This looks perfect for what i need. There some issues. The pools are started with a defined number of threads and tasks are queued until a thread is free.
How can i add new threads?
Remove expired threads? (.Join()??)
Obviously, this is much easier for a known number of threads as they can be initialised in the ctor and then join() in the dtor.
Any tips or pointers here from someone with experience with C++ concurrency?
Start with maximum number of threads a system can support:
int Num_Threads = thread::hardware_concurrency();
For an efficient threadpool implementation, once threads are created according to Num_Threads, it's better not to create new ones, or destroy old ones (by joining). There will be performance penalty, might even make your application goes slower than the serial version.
Each C++11 thread should be running in their function with an infinite loop, constantly waiting for new tasks to grab and run.
Here is how to attach such function to the thread pool:
int Num_Threads = thread::hardware_concurrency();
vector<thread> Pool;
for(int ii = 0; ii < Num_Threads; ii++)
{ Pool.push_back(thread(Infinite_loop_function));}
The Infinite_loop_function
This is a "while(true)" loop waiting for the task queue
void The_Pool:: Infinite_loop_function()
{
while(true)
{
{
unique_lock<mutex> lock(Queue_Mutex);
condition.wait(lock, []{return !Queue.empty()});
Job = Queue.front();
Queue.pop();
}
Job(); // function<void()> type
}
};
Make a function to add job to your Queue
void The_Pool:: Add_Job(function<void()> New_Job)
{
{
unique_lock<mutex> lock(Queue_Mutex);
Queue.push(New_Job);
}
condition.notify_one();
}
Bind an arbitrary function to your Queue
Pool_Obj.Add_Job(std::bind(&Some_Class::Some_Method, &Some_object));
Once you integrate these ingredients, you have your own dynamic threading pool. These threads always run, waiting for jobs to do.
This should be simple to use: https://pocoproject.org/docs/Poco.ThreadPool.html
A thread pool always keeps a number of threads running, ready to
accept work. Creating and starting a threads can impose a significant
runtime overhead to an application. A thread pool helps to improve the
performance of an application by reducing the number of threads that
have to be created (and destroyed again). Threads in a thread pool are
re-used once they become available again. The thread pool always keeps
a minimum number of threads running. If the demans for threads
increases, additional threads are created. Once the demand for threads
sinks again, no-longer used threads are stopped and removed from the
pool.
ThreadPool(
int minCapacity = 2,
int maxCapacity = 16,
int idleTime = 60,
int stackSize = 0
);
This is very nice library and easy to use not like Boost :(
https://github.com/pocoproject/poco