Decreasing number of iterations in OpenMP parallel for - c++

I have a parallel for in a C++ program that has to loop up to some number of iterations. Each iteration computes a possible solution for an algorithm, and I want to exit the loop once I find a valid one (it is ok if a few extra iterations are done). I know the number of iterations should be fixed from the beginning in the parallel for, but since I'm not increasing the number of iterations in the following code, is there any guarantee of that threads check the condition before proceeding with their current iteration?
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition)
max_its = t; // valid to make threads exit the for?
}
}

Modifying the loop counter works for most implementations of OpenMP worksharing constructs, but the program will no longer be conforming to OpenMP and there is no guarantee that the program works with other compilers.
Since the OP is OK with some extra iterations, OpenMP cancellation will be the way to go. OpenMP 4.0 introduced the "cancel" construct exactly for this purpose. It will request termination of the worksharing construct and teleport the threads to the end of it.
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition) {
#pragma omp cancel for
}
#pragma omp cancellation point for
}
}
Be aware that might there might be a price to pay in terms of performance, but you might want to accept this if the overall performance is better when aborting the loop.
In pre-4.0 implementations of OpenMP, the only OpenMP-compliant solution would be to have an if statement to approach the regular end of the loop as quickly as possible without execution the actual loop body:
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
if(!some condition) {
... loop body ...
}
}
}
Hope that helps!
Cheers,
-michael

You can't modify max_its as the standard says it must be a loop invariant expression.
What you can do, though, is using a boolean shared variable as a flag:
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel for schedule(dynamic, 1) shared(found)
for(int t = 0; t < max_its; ++t)
{
if( ! found ) {
...
}
if(some condition) {
#pragma omp atomic
found = true; // valid to make threads exit the for?
}
}
}
A logic of this kind may be also implemented with tasks instead of a work-sharing construct. A sketch of the code would be something like the following:
void algorithm(int t, bool& found) {
#pragma omp task shared(found)
{
if( !found ) {
// Do work
if ( /* conditionc*/ ) {
#pragma omp atomic
found = true
}
}
} // task
} // function
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel
{
#pragma omp single
{
for(int t = 0; t < max_its; ++t)
{
algorithm(t,found);
}
} // single
} // parallel
}
The idea is that a single thread creates max_its tasks. Each task will be assigned to a waiting thread. If some of the tasks find a valid solution, then all the others will be informed by the shared variable found.

If some_condition is a logical expression that is "always valid", then you could do:
for(int t = 0; t < max_its && !some_condition; ++t)
That way, it's very clear that !some_condition is required to continue the loop, and there is no need to read the rest of the code to find out that "if some_condition, loop ends"
Otherwise (for example if some_condition is the result of some calculation inside the loop and it's complicated to "move" the some_condition to the for-loop condition, then using break is clearly the right thing to do.

Related

How can I parallelize DFS using OpenMP?

I am trying to figure it out with OpenMP. I need to parallelize depth-first traversal.
This is the algorithm:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
I try:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <queue>
#include <sstream>
#include <omp.h>
#include <fstream>
#include <vector>
using namespace std;
vector<int> output;
vector<bool> visited;
vector < vector <int> >g;
int global = 0;
void dfs(int v)
{
printf(" potoki %i",omp_get_thread_num());
//cout<<endl;
visited[v] = true;
/*for(int i =0;i<visited.size();i++){
cout <<visited[i]<< " ";
}*/
//cout<<endl;
//global++;
output.push_back(v);
int i;
//printf(" potoki %i",omp_get_num_threads());
//cout<<endl;
for (i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task shared(visited)
{
#pragma omp critical
{
dfs(g[v][i]);
}
}
}
}
}
main(){
omp_set_num_threads(5);
int length = 1000;
int e = 4;
for (int i = 0; i < length; i++) {
visited.push_back(false);
}
int limit = (length / 2) - 1;
g.resize(length);
for (int x = 0; x < g.size(); x++) {
int p=0;
while(p<e){
int new_e = rand() % length ;
if(new_e!=x){
bool check=false;
for(int c=0;c<g[x].size();c++){
if(g[x][c]==new_e){
check=true;
}
}
if(check==false){
g[x].push_back(new_e);
p++;
}
}
}
}
ofstream fin("input.txt");
for (int i = 0; i < g.size(); i++)
{
for (int j = 0; j < g[i].size(); j++)
{
fin << g[i][j] << " ";
}
fin << endl;
}
fin.close();
/*for (int x = 0; x < g.size(); x++) {
for(int j=0;j<g[x].size();j++){
printf(" %i ", g[x][j]);
}
printf(" \n ");
}*/
double start;
double end;
start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
dfs(0);
}
}
end = omp_get_wtime();
cout << endl;
printf("Work took %f seconds\n", end - start);
cout<<global;
ofstream fout("output.txt");
for(int i=0;i<output.size();i++){
fout<<output[i]<<" ";
}
fout.close();
}
Graph "g" is generated and written to the file input.txt. The result of the program is written to the file output.txt.
But this does not work on any number of threads and is much slower.
I tried to use taskwait but in that case, only one thread works.
A critical section protects a block of code so that no more than one thread can execute it at any given time. Having the recursive call to dfs() inside a critical section means that no two tasks could make that call simultaneously. Moreover, since dfs() is recursive, any top-level task will have to wait for the entire recursion to finish before it could exit the critical section and allow a task in another thread to execute.
You need to sychronise where it will not interfere with the recursive call and only protect the update to shared data that does not provide its own internal synchronisation. This is the original code:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
A naive but still parallel version would be:
void dfs(int v){
#pragma omp critical
{
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
Here, the code leaves the critical section as soon as the tasks are created. The problem here is that the entire body of dfs() is one critical section, which means that even if there are 1000 recursive calls in parallel, they will execute one after another sequentially and not in parallel. It will even be slower than the sequential version because of the constant cache invalidation and the added OpenMP overhead.
One important note is that OpenMP critical sections, just as regular OpenMP locks, are not re-entrant, so a thread could easily deadlock itself due to encountering the same critical section in a recursive call from inside that same critical section, e.g., if a task gets executed immediately instead of being postponed. It is therefore better to implement a re-entrant critical section using OpenMP nested locks.
The reason for that code being slower than sequential is that it does nothing else except traversing the graph. If it was doing some additional work at each node, e.g., accessing data or computing node-local properties, then this work could be inserted between updating visited and the loop over the unvisited neighbours:
void dfs(int v){
#pragma omp critical
visited[v] = true;
// DO SOME WORK
#pragma omp critical
{
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
The parts in the critical sections will still execute sequentially, but the processing represented by // DO SOME WORK will overlap in parallel.
There are tricks to speed things up by reducing the lock contention introduced by having one big lock / critical section. One could, e.g., use a set of OpenMP locks and map the index of visited onto those locks, e.g., using simple modulo arithmetic as described here. It is also possible to stop creating tasks at certain level of recursion and call a sequential version of dfs() instead.
void p_dfs(int v)
{
#pragma omp critical
visited[v] = true;
#pragma omp parallel for
for (int i = 0; i < graph[v].size(); ++i)
{
#pragma omp critical
if (!visited[graph[v][i]])
{
#pragma omp task
p_dfs(graph[v][i]);
}
}
}
OpenMP is good for data-parallel code, when the amount of work is known in advance. Doesn’t work well for graph algorithms like this one.
If the only thing you do is what’s in your code (push elements into a vector), parallelism is going to make it slower. Even if you have many gigabytes of data on your graph, the bottleneck is memory not compute, multiple CPU cores won’t help. Also, if all threads gonna push results to the same vector, you’ll need synchronization. Also, reading memory recently written by another CPU core is expensive on modern processors, even more so than a cache miss.
If you have some substantial CPU-bound work besides just copying integers, look for alternatives to OpenMP. On Windows, I usually use CreateThreadpoolWork and SubmitThreadpoolWork APIs. On iOS and OSX, see grand central dispatch. On Linux see cp_thread_pool_create(3) but unlike the other two I don’t have any hands-on experience with it, just found the docs.
Regardless on the thread pool implementation you gonna use, you’ll then be able to post work to the thread pool dynamically, as you’re traversing the graph. OpenMP also has a thread pool under the hood, but the API is not flexible enough for dynamic parallelism.

OpenMP behavior - Nested Multithreading

My question pertains to nested parallelism and OpenMP. Let's start with the following single threaded code snippet:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Now let's say we want to make our calls to performAnotherTask in parallel utilizing OpenMP.
So we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
My understanding is that the calls to performAnotherTask will be performed in parallel, and by default openMP will try and use all available threads on your machine (perhaps this assumption is incorrect).
Let's say we now also want to parallelize the calls to performTask such that we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
How will this work? Will both the for loops still be multithreaded? Can we say anything on the number of threads each loop will use? Is there a way to enforce the inner for loop (within performTask) to only utilize a single thread while the outer for loop uses all available threads?
In your last example, the execution behavior depends on a few environmental settings.
First, OpenMP indeed does support such patterns, but by default disables parallel execution in a nested parallel region. To enabled it, you must set OMP_NESTED=true or call omp_set_nested(1) in your code. Then the support for nested parallel execution is enabled.
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Second, when OpenMP reaches the outer parallel region, it might grab all the available cores and assume that it can execute a thread on them, so you might want to reduce the number of threads for the outer level, so that some cores are available for in nested region. Say, if you have 32 cores, you could do this:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for num_threads(8)
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for num_threads(4)
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
The outer parallel region will execute using 4 threads, each of which will execute the inner region with 8 threads. Note, each of the 4 outer threads will be one of the master threads of the four concurrently executing nested parallel regions. If you want to be more flexible, you can inject the number of threads to use for each level using the environment variable OMP_NUM_THREADS. If you set it to OMP_NUM_THREADS=4,8 you get the same behavior as the above the first code snippet that I have posted.
The problem with the coding pattern is that you need to be careful in balancing each level to not overload the system or get load imbalances between the nested parallel regions. An alternative solution is to use OpenMP tasks instead:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Here each of the taskloop constructs will generate OpenMP task that are scheduled to execute on the threads that have been created by the single parallel region in the code. Caveate is that tasks are inherently dynamic in their behavior, so you might lose locality properties as you do not know where exactly the tasks will be executing in the system.

openmp critical section inside for loop

I have the following code that updates something inside a for loop, with another for loop coming after it. However, I got the error: "expected a declaration" at the beginning of the second loop. The problem seems to be at the "critical" part, because if I delete it, the error will be gone. I'm fresh new to openMP and I was following an example here: http://www.viva64.com/en/a/0054/#ID0EBUEM (refer to "5. Too many entries to critical sections"). Anybody has any idea what I'm doing wrong here?
Besides, is it true that "If the comparison is performed before the critical section, the critical section will not be entered during all iterations of the loop"?
Another thing is that I actually want to parallelize the two loops at the same time, but since the operations inside the loops are different, I use two thread teams here, hoping that if there are threads that are not needed in the first loop, they can start executing the second loop immediately. Will this work?
double maxValue = 0.0;
#pragma omp parallel for schedule (dynamic) //first loop
for (int i = 0; i < n; i++){
if (some condition satisfied)
{
#pragma omp atomic
count++;
continue;
}
double tmp = getValue(i);
#pragma omp flush(maxValue)
if (tmp > maxValue){
#pragma omp critical(updateMaxValue){
if (tmp > maxValue){
maxValue = tmp;
//update some other variables
...
}
}
}
}
#pragma omp parallel for schedule (dynamic) //second loop
for (int i = 0; i < m; i++){
//some operations...
}
#pragma omp barrier
Sorry that I have so many questions and thanks in advance!
However, I got the error: "expected a declaration" at the beginning of the second loop.
You have a syntax error - an opening brace, if present, must be moved to a new line:
#pragma omp critical(updateMaxValue){
// ~^~
should be changed to:
#pragma omp critical(updateMaxValue)
{
(You don't need it actually, since the if-statement that follows is a structured block).
Another thing is that I actually want to parallelize the two loops at the same time, but since the operations inside the loops are different, I use two thread teams here, hoping that if there are threads that are not needed in the first loop, they can start executing the second loop immediately.
Use a single parallel region, and then a nowait clause on the first for-loop:
#pragma omp parallel
{
#pragma omp for schedule(dynamic) nowait
// ~~~~~^
for (int i = 0; i < n; i++)
{
// ...
}
#pragma omp for schedule(dynamic)
for (int i = 0; i < m; i++)
{
// ...
}
}

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

Parallel OpenMP loop with break statement

I know that you cannot have a break statement for an OpenMP loop, but I was wondering if there is any workaround while still the benefiting from parallelism. Basically I have 'for' loop, that loops through the elements of a large vector looking for one element that satisfies a certain condition. However there is only one element that will satisfy the condition so once that is found we can break out of the loop, Thanks in advance
for(int i = 0; i <= 100000; ++i)
{
if(element[i] ...)
{
....
break;
}
}
See this snippet:
volatile bool flag=false;
#pragma omp parallel for shared(flag)
for(int i=0; i<=100000; ++i)
{
if(flag) continue;
if(element[i] ...)
{
...
flag=true;
}
}
This situation is more suitable for pthread.
You could try to manually do what the openmp for loop does, using a while loop:
const int N = 100000;
std::atomic<bool> go(true);
uint give = 0;
#pragma omp parallel
{
uint i, stop;
#pragma omp critical
{
i = give;
give += N/omp_get_num_threads();
stop = give;
if(omp_get_thread_num() == omp_get_num_threads()-1)
stop = N;
}
while(i < stop && go)
{
...
if(element[i]...)
{
go = false;
}
i++;
}
}
This way you have to test "go" each cycle, but that should not matter that much. More important is that this would correspond to a "static" omp for loop, which is only useful if you can expect all iterations to take a similar amount of time. Otherwise, 3 threads may be already finished while one still has halfway to got...
I would probably do (copied a bit from yyfn)
volatile bool flag=false;
for(int j=0; j<=100 && !flag; ++j) {
int base = 1000*j;
#pragma omp parallel for shared(flag)
for(int i = 0; i <= 1000; ++i)
{
if(flag) continue;
if(element[i+base] ...)
{
....
flag=true;
}
}
}
Here is a simpler version of the accepted answer.
int ielement = -1;
#pragma omp parallel
{
int i = omp_get_thread_num()*n/omp_get_num_threads();
int stop = (omp_get_thread_num()+1)*n/omp_get_num_threads();
for(;i <stop && ielement<0; ++i){
if(element[i]) {
ielement = i;
}
}
}
bool foundCondition = false;
#pragma omp parallel for
for(int i = 0; i <= 100000; i++)
{
// We can't break out of a parallel for loop, so this is the next best thing.
if (foundCondition == false && satisfiesComplicatedCondition(element[i]))
{
// This is definitely needed if more than one element could satisfy the
// condition and you are looking for the first one. Probably still a
// good idea even if there can only be one.
#pragma omp critical
{
// do something, store element[i], or whatever you need to do here
....
foundCondition = true;
}
}
}