How can I write codes that reuse thread on C++ openmp? - c++

I have a function f that I can use parallel processing. For this purpose, I used openmp.
However, this function is called many times, and it seems that thread creation is done every call.
How can we reuse the thread?
void f(X &src, Y &dest) {
... // do processing based on "src"
#pragma omp parallel for
for (...) {
}
...// put output into "dest"
}
int main() {
...
for(...) { // It is impossible to make this loop processing parallel one.
f(...);
}
...
return 0;
}

OpenMP implements thread pool internally, it tries to reuse threads unless you change some of its settings in between or use different application threads to call parallel regions while others are still active.
One can verify that the threads are indeed the same by using thread locals. I'd recommend you to verify your claim about recreating the threads. OpenMP runtime does lots of smart optimizations beyond obvious thread pool idea, you just need to know how to tune and control it properly.
While it is unlikely that threads are recreated, it is easy to see how threads can go to sleep by the time when you call parallel region again and it takes noticeable amount of time to wake them up. You can prevent threads from going to sleep by using OMP_WAIT_POLICY=active and/or implementation-specific environment variables like KMP_BLOCKTIME=infinite (for Intel/LLVM run-times).

This is just in addition to Anton's correct answer. If you are really concerned about the issue, for most programs you can easily move the parallel region on the outside and keep serial work serial like follows:
void f(X &src, Y &dest) {
// You can also do simple computations
// without side effects outside of the single section
#pragma omp single
{
... // do processing based on "src"
}
#pragma omp for // note parallel is missing
for (...) {
}
#pragma omp critical
...// each thread puts its own part of the output into "dest"
}
int main() {
...
// make sure to declare loop variable locally or explicitly private
#pragma omp parallel
for(type variable;...;...) {
f(...);
}
...
return 0;
}
Use this only if you have measured evidence that you are suffering from the overhead of reopening parallel regions. You may have to juggle with shared variables, or manually inline f, because all variables declared within f will be private - so how it looks in detail depends on your specific application.

Related

OpenMP parallelization inside for loops takes too long

I am preparing a program which must use OpenMP parallelization. The program is supposed to compare two frames, inside of which both frames must be compared block by block, and OpenMP must be applied in two ways: one where frame work must be split across threads, and the other where the work must be split between the threads by a block level, finding the minimum cost of each comparison.
The main idea behind the skeleton of the code would look as follows:
int main() {
// code
for () {
for () {
searchBlocks();
}
}
// code
}
searchBlocks() {
for () {
for () {
getCost()
}
}
}
getCost() {
for () {
for () {
// operations
}
}
}
Then, considering parallelization at a frame level, I can simply change the main nested loop to this
int main() {
// code
omp_set_num_threads(threadNo);
#pragma omp parallel for collapse(2) if (isFrame)
for () {
for () {
searchBlocks();
}
}
// code
}
Where threadNo is specified upon running and isFrame is obtained via a parameter to specify if frame level parallelization is needed. This works and the execution time of the program becomes shorter as the number of threads used becomes bigger. However, as I try block level parallelization, I attempted the following:
getCost() {
#pragma omp parallel for collapse(2) if (isFrame)
for () {
for () {
// operations
}
}
}
I'm doing this in getCost() considering that it is the innermost function where the comparison of each corresponding block happens, but as I do this the program takes really long to execute, so much so that if I were to run it without OpenMP support (so 1 single thread) against OpenMP support with 10 threads, the former would finish first.
Is there something that I'm not declaring right here? I'm setting the number of threads right before the nested loops of the main function, just like I had in frame level parallelization.
Please let me know if I need to explain this better, or what it is I could change in order to manage to run this parallelization successfully, and thanks to anyone who may provide help.
Remember that every time your program executes #pragma omp parallel directive, it spawns new threads. Spawning threads is very costly, and since you do getCost() many many times, and each call is not that computationally heavy, you end up wasting all the time on spawning and joining threads (which is essentially making costly system calls).
On the other hand, when #pragma omp for directive is executed, it doesn't spawn any threads, but it lets all the existing threads (which are spawned by previous parallel directive) to execute in parallel on separate pieces of data.
So what you want is to spawn threads on the top level of your computation by doing: (notice no for)
int main() {
// code
omp_set_num_threads(threadNo);
#pragma omp parallel
for () {
for () {
searchBlocks();
}
}
// code
}
and then later to split loops by doing (notice no parallel)
getCost() {
#pragma omp for collapse(2) if (isFrame)
for () {
for () {
// operations
}
}
}
You get cascading parallelization. Take the limit values in the main cycles as I,J, and in the getcost cycles as K,L: you get I * J * K * L threads. Here any operating system will go crazy. So not long before fork bomb to reach...
Well, and "collapse" is also not clear why. It's still two cycles inside, and I don't see much point in this parameter.
Try removing parallelism in Getcost.

OpenMP: how to realize thread local object in task?

What I am trying to do is to iterate over all elements of a container in parallel comparable to #pragma omp for; however the container in question does not offer a random access iterator. I therefore use a workaround via tasks described in this stackoverflow answer:
for (std::map<A,B>::iterator it = my_map.begin();
it != my_map.end();
++it)
{
#pragma omp task
{ /* do work with it */ }
}
My problem is that a 'scratch space' object is needed in each iteration; this object is expensive to construct or copy into the data environment of the task. It would only be necessary to have a single thread local object for each thread; in the sense that each task uses the object of the thread it is executed on. private requires a copy and shared results in a race condition. Is there a way to realize this with OpenMP?
I researched #pragma omp threadprivate, however the objects are not static, as the structure of the program looks something like this:
method(int argument_for_scratch_object){
#pragma omp parallel
{
Object scratch(argument_for_scratch_object);
//some computations are done here...
#pragma omp single nowait
{
//here goes the for loop creating the tasks above
//each task uses the scratch space object
}
}
}
If scratch was declared static (and then made threadprivate) before the parallel region, it would be initialized with argument_for_scratch_object of the first method call; which might not be correct for the subsequent method calls.
According to your update, I would suggest to use a global/static thread-private pointer and then initialize it by each thread within your parallel section.
static Object* scratch_ptr;
#pragma omp threadprivate(scratch_ptr);
void method(int argument_for_scratch_object)
{
#pragma omp parallel
{
scratch_ptr = new Object(argument_for_scratch_object);
...
delete scratch_ptr;
}
}

Forking once and then use the threads in a different procedure?

If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?
My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.
Pseudo-Example:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int t=0; t<1000; t++) {
evolve();
}
}
}
}
void evolve() {
#pragma omp parallel for num_threads(2)
for (int i=0; i<100; i++) {
do_stuff(i);
}
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.
For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()
I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...
Why is the behavior in Fortran and C++ different and what would be the solution in C++?
I believe orphaned directives are the cleanest solution in your case:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
// Each thread calls evolve() a thousand times
for (int t=0; t<1000; t++) {
evolve();
}
}
}
void evolve() {
// The orphaned construct inside evolve()
// will bind to the innermost parallel region
#pragma omp for
for (int i=0; i<100; i++) {
do_stuff(i);
} // Implicit thread synchronization
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
This will work because (section 2.6.1 of the standard):
A loop region binds to the innermost enclosing parallel region
That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):
OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined
Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:
void foo() {
#pragma omp parallel
#pragma omp single
bar();
}
void bar() {
#pragma omp parallel
printf("...)";
}
OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.
OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.
In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.
Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.
Cheers,
-michael
Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.

c++ OpenMP critical: "one-way" locking?

Consider the following serial function. When I parallelize my code, every thread will call this function from within the parallel region (not shown). I am trying to make this threadsafe and efficient (fast).
float get_stored_value__or__calculate_if_does_not_yet_exist( int A )
{
static std::map<int, float> my_map;
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
if (found_A)
{
return it_find->second;
}
else
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
return result_for_A;
}
}
Almost every single time this function is called, the threads will successfully "find" the stored value for their "A" (whatever it is). Every once in a while, when a "new A" is called, a value will have to be calculated and stored.
So where should I put the #pragma omp critical ?
Though easy, it is very inefficient to put a #pragma omp critical around all of this, since each thread will be doing this constantly and it will often be the read-only case.
Is there any way to implement a "one-way" critical, or a "one-way" lock routine? That is, the above operations involving the iterator should only be "locked" when writing to my_map in the else statement. But multiple threads should be able to execute the .find call simultaneously.
I hope I make sense.
Thank you.
According to this link on Stack Overflow inserting into an std::map doesn't invalidate iterators. The same goes for the end() iterator. Here's a supporting link.
Unfortunately, insertion can happen multiple times if you don't use a critical section. Also, since your calculate_value routine might be computationally expensive, you will have to lock to avoid this else clause being operated on twice with the same value of A and then inserted twice.
Here's a sample function where you can replicate this incorrect multiple insertion:
void testFunc(std::map<int,float> &theMap, int i)
{
std::map<int,float>::iterator ite = theMap.find(i);
if(ite == theMap.end())
{
theMap[i] = 3.14 * i * i;
}
}
Then called like this:
std::map<int,float> myMap;
int i;
#pragma omp parallel for
for(i=1;i<=100000;++i)
{
testFunc(myMap,i % 100);
}
if(myMap.size() != 100)
{
std::cout << "Problem!" << std::endl;
}
Edit: edited to correct error in earler version.
OpenMP is a compiler "tool" for automatic loop parallelization, not a thread communication or synchronization library; so it doesn't have sophisticated mutexes, like a read/write mutex: acquire lock on writing, but not on reading.
Here's an implementation example.
Anyway Chris A.'s answer is better than mine though :)
While #ChrisA's answer may solve your problem, I'll leave my answer here in case any future searchers find it useful.
If you'd like, you can give #pragma omp critical sections a name. Then, any section with that name is considered the same critical section. If this is what you would like to do, you can easily cause onyl small portions of your method to be critical.
#pragma omp critical map_protect
{
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
}
...
#pragma omp critical map_protect
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
}
The #pragma omp atomic and #pragma omp flush directives may also be useful.
atomic causes a write to a memory location (lvalue in the expression preceded by the directive) to always be atomic.
flush ensures that any memory expected to be available to all threads is actually written to all threads, not stored in a processor cache and unavailable where it should be.

OpenMP: Causes for heap corruption, anyone?

EDIT: I can run the same program twice, simultaneously without any problem - how can I duplicate this with OpenMP or with some other method?
This is the basic framework of the problem.
//Defined elsewhere
class SomeClass
{
public:
void Function()
{
// Allocate some memory
float *Data;
Data = new float[1024];
// Declare a struct which will be used by functions defined in the DLL
SomeStruct Obj;
Obj = MemAllocFunctionInDLL(Obj);
// Call it
FunctionDefinedInDLL(Data,Obj);
// Clean up
MemDeallocFunctionInDLL(Obj);
delete [] Data;
}
}
void Bar()
{
#pragma omp parallel for
for(int j = 0;j<10;++j)
{
SomeClass X;
X.Function();
}
}
I've verified that when some memory is attempted to be deallocated through MemDeallocFunctionInDLL(), the _CrtIsValidHeapPointer() assertion fails.
Is this because both threads are writing to the same memory?
So to fix this, I thought I'd make SomeClass private (this is totally alien to me, so any help is appreciated).
void Bar()
{
SomeClass X;
#pragma omp parallel for default(shared) private(X)
for(int j = 0;j<10;++j)
{
X.Function();
}
}
And now it fails when it tries to allocate memory in the beginning for Data.
Note: I can make changes to the DLL if required
Note: It runs perfectly without #pragma omp parallel for
EDIT: Now Bar looks like this:
void Bar()
{
int j
#pragma omp parallel for default(none) private(j)
for(j = 0;j<10;++j)
{
SomeClass X;
X.Function();
}
}
Still no luck.
Check out MemAllocFunctionInDLL, FunctionDefinedInDLL, MemDeallocFunctionInDLL are thread-safe, or re-entrant. In other words, do these functions static variables or shared variables? In such case, you need to make it sure these variables are not corrupted by other threads.
The fact without omp-for is fine could mean you didn't correctly write some functions to be thread-safe.
I'd like to see what kind of memory allocation/free functions has been used in Mem(Alloc|Dealloc)FunctionInDLL.
Added: I'm pretty sure your functions in DLL is not thread-safe. You can run this program concurrently without problem. Yes, it should be okay unless your program uses system-wide shared resources (such as global memory or shared memory among processes), which is very rare. In this case, no shared variables in threads, so your program works fine.
But, invoking these functions in mutithreads (that means in a single process) crashes your program. It means there are some shared variables among threads, and it could have been corrupted.
It's not a problem of OpenMP, but just a multithreading bug. It could be simple to solve this problem. Please take a look the DLL functions whether they are safe to be called in concurrent by many threads.
How to privatize static variables
Say that we have such global variables:
static int g_data;
static int* g_vector = new int[100];
Privatization is nothing but a creating private copy for each thread.
int g_data[num_threads];
int* g_vector[num_threads];
for (int i = 0; i < num_threads; ++i)
g_vector[i] = new int[100];
And, then any references on such variables are
// Thread: tid
g_data[tid] = ...
.. = g_vector[tid][..]
Yes, it's pretty simple. However, this sort of code may have a false sharing problem. But, false sharing is a matter of performance, not correctness.
First, just try to privatize any static and global variables. Then, check it correctness. Next, see the speedup you would get. If the speedup is scalable (say 3.7x faster on quad core), then it's okay. But, in case of low speedup (such as 2x speedup on quad core), then you probably look at the false sharing problem. To solve false sharing problem, all you need to do is just putting some padding in data structures.
Instead of
delete Data
you must write
delete [] Data;
Wherever you do new [], make sure to use delete [].
It looks like your problem is not specific to openmp. Did you try to run your application without including #pragma parallel?
default(shared) means all variables are shared between threads, which is not what you want. Change that to default(none).
Private(X) will make a copy of X for each thread, however, none of them will be initialised so any construction will not necessarily be performed.
I think you'd be better with your initial approach, put a breakpoint in the Dealloc call, and see what the memory pointer is and what it contains. You can see the guard bytes to tell if the memory has been overwritten at the end of a single call, or after a thread.
Incidentally, I am assuming this works if you run it once, without the omp loop?