OpenMP behavior - Nested Multithreading - c++

My question pertains to nested parallelism and OpenMP. Let's start with the following single threaded code snippet:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Now let's say we want to make our calls to performAnotherTask in parallel utilizing OpenMP.
So we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
My understanding is that the calls to performAnotherTask will be performed in parallel, and by default openMP will try and use all available threads on your machine (perhaps this assumption is incorrect).
Let's say we now also want to parallelize the calls to performTask such that we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
How will this work? Will both the for loops still be multithreaded? Can we say anything on the number of threads each loop will use? Is there a way to enforce the inner for loop (within performTask) to only utilize a single thread while the outer for loop uses all available threads?

In your last example, the execution behavior depends on a few environmental settings.
First, OpenMP indeed does support such patterns, but by default disables parallel execution in a nested parallel region. To enabled it, you must set OMP_NESTED=true or call omp_set_nested(1) in your code. Then the support for nested parallel execution is enabled.
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Second, when OpenMP reaches the outer parallel region, it might grab all the available cores and assume that it can execute a thread on them, so you might want to reduce the number of threads for the outer level, so that some cores are available for in nested region. Say, if you have 32 cores, you could do this:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for num_threads(8)
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for num_threads(4)
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
The outer parallel region will execute using 4 threads, each of which will execute the inner region with 8 threads. Note, each of the 4 outer threads will be one of the master threads of the four concurrently executing nested parallel regions. If you want to be more flexible, you can inject the number of threads to use for each level using the environment variable OMP_NUM_THREADS. If you set it to OMP_NUM_THREADS=4,8 you get the same behavior as the above the first code snippet that I have posted.
The problem with the coding pattern is that you need to be careful in balancing each level to not overload the system or get load imbalances between the nested parallel regions. An alternative solution is to use OpenMP tasks instead:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Here each of the taskloop constructs will generate OpenMP task that are scheduled to execute on the threads that have been created by the single parallel region in the code. Caveate is that tasks are inherently dynamic in their behavior, so you might lose locality properties as you do not know where exactly the tasks will be executing in the system.

Related

using omp threads inside function

I'm trying to play a little with omp threads. Inorder to make "main" function more clean, I want to use omp threads inside function which called by the main function.
Here we have an example:
void main()
{
func();
}
void func()
{
#pragma omp parallel for
for (int i = 0; i < 5; i++)
{
for (int j = 0; j < 5; j++)
{
doSomething();
}
}
}
With complex computations, when running, after thread 0 finishes, the funcion returns while the other threads haven't finish yet. How can I suspend the return until all threads finishes?
Using barrier inside for loop is impossible, so I don't have another idea.

atomic inside a single construct

In an openMP framework, suppose I have a series of tasks that should be done by a single task. Each task is different, so I cannot fit into a #pragma omp for construct. Inside the single construct, each task updates a variable shared by all tasks. How can I protect the update of such a variable?
A simplified example:
#include <vector>
struct A {
std::vector<double> x, y, z;
};
int main()
{
A r;
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
#pragma omp barrier
return 0;
}
The code lines below // DANGER are problematic because they modify the memory contents of a shared variable.
In the example above, it might be that it still works without issues, because I am effectively modifying different members of r. Still the problem is: how can I make sure that tasks do not simultaineusly update r? Is there a "sort-of" atomic pragma for the single construct?
There is no data race in your original code, because x,y, and z are different vectors in struct A (as already emphasized by #463035818_is_not_a_number), so in this respect you do not have to change anything in your code.
However, a #pragma omp parallel directive is missing in your code, so at the moment it is a serial program. So, it should look like this:
#pragma omp parallel num_threads(3)
{
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
}
In this case #pragma omp barrier is not necessary as there is an implied barrier at the end of parallel region. Note that I have used num_threads(3) clause to make sure that only 3 threads are assigned to this parallel region. If you skip this clause then all other threads just wait at the barrier.
In the case of an actual data race (i.e. more than one single region/section changes the same struct member), you can use #pragma omp critical (name) to rectify this. But keep in mind that this kind of serialization can negate the benefits of multithreading when there is not enough real parallel work beside the critical section.
Note that, a much better solution is to use #pragma omp sections (as suggested by #PaulG). If the number of tasks to run parallel is known at compile time sections are the typical choice in OpenMP:
#pragma omp parallel sections
{
#pragma omp section
{
//Task 1 here
}
#pragma omp section
{
//Task 2
}
#pragma omp section
{
// Task 3
}
}
For the record, I would like to show that it is easy to do it by #pragma omp for as well:
#pragma omp parallel for
for(int i=0;i<3;i++)
{
if (i==0)
{
// Task 1
} else if (i==1)
{
// Task 2
}
else if (i==2)
{
// Task 3
}
}
each task updates a variable shared by all tasks.
Actually they don't. Consider you rewrite the code like this (you don't need the temporary vectors):
void foo( std::vector<double>& x, std::vector<double>& y, std::vector<double>& z) {
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
x.push_back(i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
y.push_back(i * i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
z.push_back(i * i + 2);
}
#pragma omp barrier
}
As long as the caller can ensure that x, y and z do not refer to the same object, there is no data race. Each part of the code modifies a seperate vector. No synchronization needed.
Now, it does not matter where those vectors come from. You can call the function like this:
A r;
foo(r.x, r.y, r.z);
PS: I am not familiar with omp anymore, I assumed the annotations correctly do what you want them to do.

How can I parallelize DFS using OpenMP?

I am trying to figure it out with OpenMP. I need to parallelize depth-first traversal.
This is the algorithm:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
I try:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <queue>
#include <sstream>
#include <omp.h>
#include <fstream>
#include <vector>
using namespace std;
vector<int> output;
vector<bool> visited;
vector < vector <int> >g;
int global = 0;
void dfs(int v)
{
printf(" potoki %i",omp_get_thread_num());
//cout<<endl;
visited[v] = true;
/*for(int i =0;i<visited.size();i++){
cout <<visited[i]<< " ";
}*/
//cout<<endl;
//global++;
output.push_back(v);
int i;
//printf(" potoki %i",omp_get_num_threads());
//cout<<endl;
for (i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task shared(visited)
{
#pragma omp critical
{
dfs(g[v][i]);
}
}
}
}
}
main(){
omp_set_num_threads(5);
int length = 1000;
int e = 4;
for (int i = 0; i < length; i++) {
visited.push_back(false);
}
int limit = (length / 2) - 1;
g.resize(length);
for (int x = 0; x < g.size(); x++) {
int p=0;
while(p<e){
int new_e = rand() % length ;
if(new_e!=x){
bool check=false;
for(int c=0;c<g[x].size();c++){
if(g[x][c]==new_e){
check=true;
}
}
if(check==false){
g[x].push_back(new_e);
p++;
}
}
}
}
ofstream fin("input.txt");
for (int i = 0; i < g.size(); i++)
{
for (int j = 0; j < g[i].size(); j++)
{
fin << g[i][j] << " ";
}
fin << endl;
}
fin.close();
/*for (int x = 0; x < g.size(); x++) {
for(int j=0;j<g[x].size();j++){
printf(" %i ", g[x][j]);
}
printf(" \n ");
}*/
double start;
double end;
start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
dfs(0);
}
}
end = omp_get_wtime();
cout << endl;
printf("Work took %f seconds\n", end - start);
cout<<global;
ofstream fout("output.txt");
for(int i=0;i<output.size();i++){
fout<<output[i]<<" ";
}
fout.close();
}
Graph "g" is generated and written to the file input.txt. The result of the program is written to the file output.txt.
But this does not work on any number of threads and is much slower.
I tried to use taskwait but in that case, only one thread works.
A critical section protects a block of code so that no more than one thread can execute it at any given time. Having the recursive call to dfs() inside a critical section means that no two tasks could make that call simultaneously. Moreover, since dfs() is recursive, any top-level task will have to wait for the entire recursion to finish before it could exit the critical section and allow a task in another thread to execute.
You need to sychronise where it will not interfere with the recursive call and only protect the update to shared data that does not provide its own internal synchronisation. This is the original code:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
A naive but still parallel version would be:
void dfs(int v){
#pragma omp critical
{
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
Here, the code leaves the critical section as soon as the tasks are created. The problem here is that the entire body of dfs() is one critical section, which means that even if there are 1000 recursive calls in parallel, they will execute one after another sequentially and not in parallel. It will even be slower than the sequential version because of the constant cache invalidation and the added OpenMP overhead.
One important note is that OpenMP critical sections, just as regular OpenMP locks, are not re-entrant, so a thread could easily deadlock itself due to encountering the same critical section in a recursive call from inside that same critical section, e.g., if a task gets executed immediately instead of being postponed. It is therefore better to implement a re-entrant critical section using OpenMP nested locks.
The reason for that code being slower than sequential is that it does nothing else except traversing the graph. If it was doing some additional work at each node, e.g., accessing data or computing node-local properties, then this work could be inserted between updating visited and the loop over the unvisited neighbours:
void dfs(int v){
#pragma omp critical
visited[v] = true;
// DO SOME WORK
#pragma omp critical
{
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
The parts in the critical sections will still execute sequentially, but the processing represented by // DO SOME WORK will overlap in parallel.
There are tricks to speed things up by reducing the lock contention introduced by having one big lock / critical section. One could, e.g., use a set of OpenMP locks and map the index of visited onto those locks, e.g., using simple modulo arithmetic as described here. It is also possible to stop creating tasks at certain level of recursion and call a sequential version of dfs() instead.
void p_dfs(int v)
{
#pragma omp critical
visited[v] = true;
#pragma omp parallel for
for (int i = 0; i < graph[v].size(); ++i)
{
#pragma omp critical
if (!visited[graph[v][i]])
{
#pragma omp task
p_dfs(graph[v][i]);
}
}
}
OpenMP is good for data-parallel code, when the amount of work is known in advance. Doesn’t work well for graph algorithms like this one.
If the only thing you do is what’s in your code (push elements into a vector), parallelism is going to make it slower. Even if you have many gigabytes of data on your graph, the bottleneck is memory not compute, multiple CPU cores won’t help. Also, if all threads gonna push results to the same vector, you’ll need synchronization. Also, reading memory recently written by another CPU core is expensive on modern processors, even more so than a cache miss.
If you have some substantial CPU-bound work besides just copying integers, look for alternatives to OpenMP. On Windows, I usually use CreateThreadpoolWork and SubmitThreadpoolWork APIs. On iOS and OSX, see grand central dispatch. On Linux see cp_thread_pool_create(3) but unlike the other two I don’t have any hands-on experience with it, just found the docs.
Regardless on the thread pool implementation you gonna use, you’ll then be able to post work to the thread pool dynamically, as you’re traversing the graph. OpenMP also has a thread pool under the hood, but the API is not flexible enough for dynamic parallelism.

Decreasing number of iterations in OpenMP parallel for

I have a parallel for in a C++ program that has to loop up to some number of iterations. Each iteration computes a possible solution for an algorithm, and I want to exit the loop once I find a valid one (it is ok if a few extra iterations are done). I know the number of iterations should be fixed from the beginning in the parallel for, but since I'm not increasing the number of iterations in the following code, is there any guarantee of that threads check the condition before proceeding with their current iteration?
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition)
max_its = t; // valid to make threads exit the for?
}
}
Modifying the loop counter works for most implementations of OpenMP worksharing constructs, but the program will no longer be conforming to OpenMP and there is no guarantee that the program works with other compilers.
Since the OP is OK with some extra iterations, OpenMP cancellation will be the way to go. OpenMP 4.0 introduced the "cancel" construct exactly for this purpose. It will request termination of the worksharing construct and teleport the threads to the end of it.
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition) {
#pragma omp cancel for
}
#pragma omp cancellation point for
}
}
Be aware that might there might be a price to pay in terms of performance, but you might want to accept this if the overall performance is better when aborting the loop.
In pre-4.0 implementations of OpenMP, the only OpenMP-compliant solution would be to have an if statement to approach the regular end of the loop as quickly as possible without execution the actual loop body:
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
if(!some condition) {
... loop body ...
}
}
}
Hope that helps!
Cheers,
-michael
You can't modify max_its as the standard says it must be a loop invariant expression.
What you can do, though, is using a boolean shared variable as a flag:
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel for schedule(dynamic, 1) shared(found)
for(int t = 0; t < max_its; ++t)
{
if( ! found ) {
...
}
if(some condition) {
#pragma omp atomic
found = true; // valid to make threads exit the for?
}
}
}
A logic of this kind may be also implemented with tasks instead of a work-sharing construct. A sketch of the code would be something like the following:
void algorithm(int t, bool& found) {
#pragma omp task shared(found)
{
if( !found ) {
// Do work
if ( /* conditionc*/ ) {
#pragma omp atomic
found = true
}
}
} // task
} // function
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel
{
#pragma omp single
{
for(int t = 0; t < max_its; ++t)
{
algorithm(t,found);
}
} // single
} // parallel
}
The idea is that a single thread creates max_its tasks. Each task will be assigned to a waiting thread. If some of the tasks find a valid solution, then all the others will be informed by the shared variable found.
If some_condition is a logical expression that is "always valid", then you could do:
for(int t = 0; t < max_its && !some_condition; ++t)
That way, it's very clear that !some_condition is required to continue the loop, and there is no need to read the rest of the code to find out that "if some_condition, loop ends"
Otherwise (for example if some_condition is the result of some calculation inside the loop and it's complicated to "move" the some_condition to the for-loop condition, then using break is clearly the right thing to do.

Parallel OpenMP loop with break statement

I know that you cannot have a break statement for an OpenMP loop, but I was wondering if there is any workaround while still the benefiting from parallelism. Basically I have 'for' loop, that loops through the elements of a large vector looking for one element that satisfies a certain condition. However there is only one element that will satisfy the condition so once that is found we can break out of the loop, Thanks in advance
for(int i = 0; i <= 100000; ++i)
{
if(element[i] ...)
{
....
break;
}
}
See this snippet:
volatile bool flag=false;
#pragma omp parallel for shared(flag)
for(int i=0; i<=100000; ++i)
{
if(flag) continue;
if(element[i] ...)
{
...
flag=true;
}
}
This situation is more suitable for pthread.
You could try to manually do what the openmp for loop does, using a while loop:
const int N = 100000;
std::atomic<bool> go(true);
uint give = 0;
#pragma omp parallel
{
uint i, stop;
#pragma omp critical
{
i = give;
give += N/omp_get_num_threads();
stop = give;
if(omp_get_thread_num() == omp_get_num_threads()-1)
stop = N;
}
while(i < stop && go)
{
...
if(element[i]...)
{
go = false;
}
i++;
}
}
This way you have to test "go" each cycle, but that should not matter that much. More important is that this would correspond to a "static" omp for loop, which is only useful if you can expect all iterations to take a similar amount of time. Otherwise, 3 threads may be already finished while one still has halfway to got...
I would probably do (copied a bit from yyfn)
volatile bool flag=false;
for(int j=0; j<=100 && !flag; ++j) {
int base = 1000*j;
#pragma omp parallel for shared(flag)
for(int i = 0; i <= 1000; ++i)
{
if(flag) continue;
if(element[i+base] ...)
{
....
flag=true;
}
}
}
Here is a simpler version of the accepted answer.
int ielement = -1;
#pragma omp parallel
{
int i = omp_get_thread_num()*n/omp_get_num_threads();
int stop = (omp_get_thread_num()+1)*n/omp_get_num_threads();
for(;i <stop && ielement<0; ++i){
if(element[i]) {
ielement = i;
}
}
}
bool foundCondition = false;
#pragma omp parallel for
for(int i = 0; i <= 100000; i++)
{
// We can't break out of a parallel for loop, so this is the next best thing.
if (foundCondition == false && satisfiesComplicatedCondition(element[i]))
{
// This is definitely needed if more than one element could satisfy the
// condition and you are looking for the first one. Probably still a
// good idea even if there can only be one.
#pragma omp critical
{
// do something, store element[i], or whatever you need to do here
....
foundCondition = true;
}
}
}