OpenMP parallelization inside for loops takes too long - c++

I am preparing a program which must use OpenMP parallelization. The program is supposed to compare two frames, inside of which both frames must be compared block by block, and OpenMP must be applied in two ways: one where frame work must be split across threads, and the other where the work must be split between the threads by a block level, finding the minimum cost of each comparison.
The main idea behind the skeleton of the code would look as follows:
int main() {
// code
for () {
for () {
searchBlocks();
}
}
// code
}
searchBlocks() {
for () {
for () {
getCost()
}
}
}
getCost() {
for () {
for () {
// operations
}
}
}
Then, considering parallelization at a frame level, I can simply change the main nested loop to this
int main() {
// code
omp_set_num_threads(threadNo);
#pragma omp parallel for collapse(2) if (isFrame)
for () {
for () {
searchBlocks();
}
}
// code
}
Where threadNo is specified upon running and isFrame is obtained via a parameter to specify if frame level parallelization is needed. This works and the execution time of the program becomes shorter as the number of threads used becomes bigger. However, as I try block level parallelization, I attempted the following:
getCost() {
#pragma omp parallel for collapse(2) if (isFrame)
for () {
for () {
// operations
}
}
}
I'm doing this in getCost() considering that it is the innermost function where the comparison of each corresponding block happens, but as I do this the program takes really long to execute, so much so that if I were to run it without OpenMP support (so 1 single thread) against OpenMP support with 10 threads, the former would finish first.
Is there something that I'm not declaring right here? I'm setting the number of threads right before the nested loops of the main function, just like I had in frame level parallelization.
Please let me know if I need to explain this better, or what it is I could change in order to manage to run this parallelization successfully, and thanks to anyone who may provide help.

Remember that every time your program executes #pragma omp parallel directive, it spawns new threads. Spawning threads is very costly, and since you do getCost() many many times, and each call is not that computationally heavy, you end up wasting all the time on spawning and joining threads (which is essentially making costly system calls).
On the other hand, when #pragma omp for directive is executed, it doesn't spawn any threads, but it lets all the existing threads (which are spawned by previous parallel directive) to execute in parallel on separate pieces of data.
So what you want is to spawn threads on the top level of your computation by doing: (notice no for)
int main() {
// code
omp_set_num_threads(threadNo);
#pragma omp parallel
for () {
for () {
searchBlocks();
}
}
// code
}
and then later to split loops by doing (notice no parallel)
getCost() {
#pragma omp for collapse(2) if (isFrame)
for () {
for () {
// operations
}
}
}

You get cascading parallelization. Take the limit values in the main cycles as I,J, and in the getcost cycles as K,L: you get I * J * K * L threads. Here any operating system will go crazy. So not long before fork bomb to reach...
Well, and "collapse" is also not clear why. It's still two cycles inside, and I don't see much point in this parameter.
Try removing parallelism in Getcost.

Related

Sampling conditional distribution OpenMP

I have a function which draws a random sample:
Sample sample();
I have a function which checks weather a sample is valid:
bool is_valid(Sample s);
This simulates a conditional distribution. Now I want a lot of valid samples (most samples will not be valid).
So I want to parallelize this code with openMP
vector<Sample> valid_samples;
while(valid_samples.size() < n) {
Sample s = sample();
if(is_valid(s)) {
valid_samples.push_back(s);
}
}
How would I do this? Most of the OpenMP code I found were simple for loops where the number of iterations is determined in the beginning.
The sample() function has a
thread_local std::mt19937_64 gen([](){
std::random_device d;
std::uniform_int_distribution<int> dist(0,10000);
return dist(d);
}());
as a random number generator. Is is valid and thread save if I assume that my device has a source of randomness? Are there better solutions?
You may employ OpenMP task parallelism. The simplest solution would be to define a task as a single sample insertion:
vector<Sample> valid_samples(n); // need to be resized to allow access in parallel
void insert_ith(size_t i)
{
do {
valid_samples[i] = sample();
} while (!is_valid(valid_samples[i]));
}
#pragma omp parallel
{
#pragma omp single
{
for (size_t i = 0; i < n; i++)
{
#pragma omp task
insert_ith(i);
}
}
}
Note that there might be performance issues with such single-task-single-insertion mapping. First, there would be false sharing involved, but likely worse, tasking management has some overhead which might be significant for very small tasks. In such a case, a remedy is simple — instead of a single insertion per tasks, insert multiple items at once, such as 100. Usually, a suitable number is a trade-off: the lower creates more tasks = more overhead, the higher may result in worse load balancing.
You need to take care of the critical section in your code, which is the insertion into the answer vector
something like this should work (haven't compiled because functions and types are not given)
// create vector before parallel section because it shall be shared
vector<Sample> valid_samples(n); // set initial size to avoid reallocation
int reached_count = 0;
#pragma omp parallel shared(valid_samples, n, reached_count)
{
while(reached_count < n) { // changed this, see comments for discussion
Sample s = sample(); // I assume this to be thread indepent
if(is_valid(s)) {
#pragma omp critical
{
// check condition again, another thread might have already
// reached maximum number
if(reached_count < n) {
valid_samples.push_back(s);
reached_count = valid_samples.size();
}
}
}
}
}
note that neither sample() nor isvalid(s) are inside of the critical section because I assume these functions to be far more expensive than the vector insertion or the acceptance is very rare
If that is not the case, you could work with indepent local vectors and merge in the end, but that would only gain a significant benefit if you reduce the number of synchronization in some way, like giving a fixed number of iterations (at least for a large part)

How can I write codes that reuse thread on C++ openmp?

I have a function f that I can use parallel processing. For this purpose, I used openmp.
However, this function is called many times, and it seems that thread creation is done every call.
How can we reuse the thread?
void f(X &src, Y &dest) {
... // do processing based on "src"
#pragma omp parallel for
for (...) {
}
...// put output into "dest"
}
int main() {
...
for(...) { // It is impossible to make this loop processing parallel one.
f(...);
}
...
return 0;
}
OpenMP implements thread pool internally, it tries to reuse threads unless you change some of its settings in between or use different application threads to call parallel regions while others are still active.
One can verify that the threads are indeed the same by using thread locals. I'd recommend you to verify your claim about recreating the threads. OpenMP runtime does lots of smart optimizations beyond obvious thread pool idea, you just need to know how to tune and control it properly.
While it is unlikely that threads are recreated, it is easy to see how threads can go to sleep by the time when you call parallel region again and it takes noticeable amount of time to wake them up. You can prevent threads from going to sleep by using OMP_WAIT_POLICY=active and/or implementation-specific environment variables like KMP_BLOCKTIME=infinite (for Intel/LLVM run-times).
This is just in addition to Anton's correct answer. If you are really concerned about the issue, for most programs you can easily move the parallel region on the outside and keep serial work serial like follows:
void f(X &src, Y &dest) {
// You can also do simple computations
// without side effects outside of the single section
#pragma omp single
{
... // do processing based on "src"
}
#pragma omp for // note parallel is missing
for (...) {
}
#pragma omp critical
...// each thread puts its own part of the output into "dest"
}
int main() {
...
// make sure to declare loop variable locally or explicitly private
#pragma omp parallel
for(type variable;...;...) {
f(...);
}
...
return 0;
}
Use this only if you have measured evidence that you are suffering from the overhead of reopening parallel regions. You may have to juggle with shared variables, or manually inline f, because all variables declared within f will be private - so how it looks in detail depends on your specific application.

How to run two set of code in parallel using openmp in c++

I have two function which are not related to each other for example:
int add(int num)
{
int sum=0;
for(i=0;i<num;++i)
sum+=i;
return sum;
}
int mul(int num)
{
int mul=1;
for(int i=1;i<num;++i)
mul * i;
return mul;
}
and I am suing them as follow:
auto x=add(100);
auto m=mul(200);
cout<<a<< " " <<m<<endl;
How can I run them in parallel using OpenMP? I know that I can run them in parallel if I create a new thread and run one of the functions in that thread and implement a sync mechanizim to make sure that both threads finisshes by time that cout is called.
Also I know that I can use openMP parallel for for my loops, but assume that it is not there.
The usual way to sort out this problem in OpenMP is the sections construct. It enables you to define parts of your sequential code, that can be computed concurrently by different threads. Each section starts with the omp section directive/pragma.

Forking once and then use the threads in a different procedure?

If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?
My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.
Pseudo-Example:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int t=0; t<1000; t++) {
evolve();
}
}
}
}
void evolve() {
#pragma omp parallel for num_threads(2)
for (int i=0; i<100; i++) {
do_stuff(i);
}
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.
For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()
I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...
Why is the behavior in Fortran and C++ different and what would be the solution in C++?
I believe orphaned directives are the cleanest solution in your case:
double A[];
int main() {
#pragma omp parallel num_threads(2)
{
// Each thread calls evolve() a thousand times
for (int t=0; t<1000; t++) {
evolve();
}
}
}
void evolve() {
// The orphaned construct inside evolve()
// will bind to the innermost parallel region
#pragma omp for
for (int i=0; i<100; i++) {
do_stuff(i);
} // Implicit thread synchronization
}
void do_stuff(int i) {
// expensive calculation on array element A[i]
}
This will work because (section 2.6.1 of the standard):
A loop region binds to the innermost enclosing parallel region
That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):
OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined
Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:
void foo() {
#pragma omp parallel
#pragma omp single
bar();
}
void bar() {
#pragma omp parallel
printf("...)";
}
OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.
OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.
In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.
Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.
Cheers,
-michael
Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.

c++ OpenMP critical: "one-way" locking?

Consider the following serial function. When I parallelize my code, every thread will call this function from within the parallel region (not shown). I am trying to make this threadsafe and efficient (fast).
float get_stored_value__or__calculate_if_does_not_yet_exist( int A )
{
static std::map<int, float> my_map;
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
if (found_A)
{
return it_find->second;
}
else
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
return result_for_A;
}
}
Almost every single time this function is called, the threads will successfully "find" the stored value for their "A" (whatever it is). Every once in a while, when a "new A" is called, a value will have to be calculated and stored.
So where should I put the #pragma omp critical ?
Though easy, it is very inefficient to put a #pragma omp critical around all of this, since each thread will be doing this constantly and it will often be the read-only case.
Is there any way to implement a "one-way" critical, or a "one-way" lock routine? That is, the above operations involving the iterator should only be "locked" when writing to my_map in the else statement. But multiple threads should be able to execute the .find call simultaneously.
I hope I make sense.
Thank you.
According to this link on Stack Overflow inserting into an std::map doesn't invalidate iterators. The same goes for the end() iterator. Here's a supporting link.
Unfortunately, insertion can happen multiple times if you don't use a critical section. Also, since your calculate_value routine might be computationally expensive, you will have to lock to avoid this else clause being operated on twice with the same value of A and then inserted twice.
Here's a sample function where you can replicate this incorrect multiple insertion:
void testFunc(std::map<int,float> &theMap, int i)
{
std::map<int,float>::iterator ite = theMap.find(i);
if(ite == theMap.end())
{
theMap[i] = 3.14 * i * i;
}
}
Then called like this:
std::map<int,float> myMap;
int i;
#pragma omp parallel for
for(i=1;i<=100000;++i)
{
testFunc(myMap,i % 100);
}
if(myMap.size() != 100)
{
std::cout << "Problem!" << std::endl;
}
Edit: edited to correct error in earler version.
OpenMP is a compiler "tool" for automatic loop parallelization, not a thread communication or synchronization library; so it doesn't have sophisticated mutexes, like a read/write mutex: acquire lock on writing, but not on reading.
Here's an implementation example.
Anyway Chris A.'s answer is better than mine though :)
While #ChrisA's answer may solve your problem, I'll leave my answer here in case any future searchers find it useful.
If you'd like, you can give #pragma omp critical sections a name. Then, any section with that name is considered the same critical section. If this is what you would like to do, you can easily cause onyl small portions of your method to be critical.
#pragma omp critical map_protect
{
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
}
...
#pragma omp critical map_protect
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
}
The #pragma omp atomic and #pragma omp flush directives may also be useful.
atomic causes a write to a memory location (lvalue in the expression preceded by the directive) to always be atomic.
flush ensures that any memory expected to be available to all threads is actually written to all threads, not stored in a processor cache and unavailable where it should be.