I'm trying to play a little with omp threads. Inorder to make "main" function more clean, I want to use omp threads inside function which called by the main function.
Here we have an example:
void main()
{
func();
}
void func()
{
#pragma omp parallel for
for (int i = 0; i < 5; i++)
{
for (int j = 0; j < 5; j++)
{
doSomething();
}
}
}
With complex computations, when running, after thread 0 finishes, the funcion returns while the other threads haven't finish yet. How can I suspend the return until all threads finishes?
Using barrier inside for loop is impossible, so I don't have another idea.
Related
My question pertains to nested parallelism and OpenMP. Let's start with the following single threaded code snippet:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Now let's say we want to make our calls to performAnotherTask in parallel utilizing OpenMP.
So we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
My understanding is that the calls to performAnotherTask will be performed in parallel, and by default openMP will try and use all available threads on your machine (perhaps this assumption is incorrect).
Let's say we now also want to parallelize the calls to performTask such that we get the following code:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
How will this work? Will both the for loops still be multithreaded? Can we say anything on the number of threads each loop will use? Is there a way to enforce the inner for loop (within performTask) to only utilize a single thread while the outer for loop uses all available threads?
In your last example, the execution behavior depends on a few environmental settings.
First, OpenMP indeed does support such patterns, but by default disables parallel execution in a nested parallel region. To enabled it, you must set OMP_NESTED=true or call omp_set_nested(1) in your code. Then the support for nested parallel execution is enabled.
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Second, when OpenMP reaches the outer parallel region, it might grab all the available cores and assume that it can execute a thread on them, so you might want to reduce the number of threads for the outer level, so that some cores are available for in nested region. Say, if you have 32 cores, you could do this:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp parallel for num_threads(8)
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel for num_threads(4)
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
The outer parallel region will execute using 4 threads, each of which will execute the inner region with 8 threads. Note, each of the 4 outer threads will be one of the master threads of the four concurrently executing nested parallel regions. If you want to be more flexible, you can inject the number of threads to use for each level using the environment variable OMP_NUM_THREADS. If you set it to OMP_NUM_THREADS=4,8 you get the same behavior as the above the first code snippet that I have posted.
The problem with the coding pattern is that you need to be careful in balancing each level to not overload the system or get load imbalances between the nested parallel regions. An alternative solution is to use OpenMP tasks instead:
void performAnotherTask() {
// DO something here
}
void performTask() {
// Do other stuff here
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performAnotherTask();
}
}
int main() {
omp_set_nested(1);
#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for (size_t i = 0; i < 100; ++i) {
performTask();
}
return 0;
}
Here each of the taskloop constructs will generate OpenMP task that are scheduled to execute on the threads that have been created by the single parallel region in the code. Caveate is that tasks are inherently dynamic in their behavior, so you might lose locality properties as you do not know where exactly the tasks will be executing in the system.
I have a list of jobs, which I am processing in parallel with OpenMP:
void processAllJobs()
{
#pragma omp parallel for
for(int i = 0; i < n; ++i)
processJob(i);
}
All jobs have some sequential parts and parts that could be parallelized if called alone:
void processJob(int i)
{
for(int iteration = 0; iteration < iterationCount; ++iteration)
{
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp parallel for
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
}
}
When I run processAllJobs(), threads are created for the outer loop (over each job) and the inner loop (over the subtasks) are done sequentially within the thread. This is all fine and intended.
Sometimes there are very large jobs that take a lot of time to process. Long enough, such that all other threads in the outer loop already finish way before the last thread and don't do anything. Is there a way to re-purpose the unused threads to parallelize the inner loop as soon as they are finished? I imagine something that checks the number of unused threads each time the inner parallel region is entered.
I cannot predict how long a job runs. It might not only be one long-lasting job - maybe there are two or three.
Your description of the problem sounds more like OpenMP tasking will be a much better choice. Your code would then look like this:
void processAllJobs()
{
#pragma omp parallel master
for(int i = 0; i < n; ++i)
#pragma omp task
processJob(i);
}
Then the processing of the job would look like this:
void processJob(int i)
{
for(int iteration = 0; iteration < iterationCount; ++iteration)
{
doSomePreparation(i);
std::vector<Subtask> subtasks = getSubtasks(i);
#pragma omp taskloop // add grainsize() clause, if Process() is very short
for(int j = 0; j < substasks.size(); ++j)
subtasks[j].Process();
doSomePostProcessing(i)
}
}
That way you get natural load balancing (assuming that you have enough tasks) without having to rely on nested parallelism.
Is there a way to span an OpenMP parallel region across multiple functions?
void run()
{
omp_set_num_threads(2);
#pragma omp parallel
{
foo();
#pragma omp for
for(int i = 0; i < 10; ++i)
{
//Do stuff here
}
}
}
void foo()
{
#pragma omp for
for(int j = 0; j < 10; ++j)
{
// Have this code be run as a worksharing loop by the OMP threads
// spawned in run
}
}
In this example, I want the threads started in the omp parallel region in the run function to enter foo, and run it as a working sharing loop, the same way they would run the for loop in run. Is this what happens by default or does each thread run the loop independently? How do you test for each?
In my example, function foo and run are member functions is separate classes.
Thanks!
What you describe as your desire is how OpenMP works.
I have a function that I need to make sure that when it is called, it in run only on one thread. So my function is something such as this:
int ReadValue(int position)
{
// read data from a file
}
This function may be called from other functions that mayb at some stage be part of openmp parallel for .
I want to make sure that it is run only on one thread if it is called in a parallel way.
How can I do this?
Here's a full example (tested on 8 cores/threads):
#include <omp.h>
static int counter = 0;
int ReadValue(int position)
{
// read data from a file one thread at a time
#pragma omp critical
{
// critical section code goes here
++counter;
}
return counter;
}
int main() {
const int count = 1 << 20;
// loop and raise protected counter
#pragma omp parallel for
for (int i = 0; i < count; ++i) {
ReadValue(i);
}
// if counter was properly protected nothing will be printed.
if (ReadValue(0) != count + 1)
printf("failure!");
return 0;
}
I have a parallel for in a C++ program that has to loop up to some number of iterations. Each iteration computes a possible solution for an algorithm, and I want to exit the loop once I find a valid one (it is ok if a few extra iterations are done). I know the number of iterations should be fixed from the beginning in the parallel for, but since I'm not increasing the number of iterations in the following code, is there any guarantee of that threads check the condition before proceeding with their current iteration?
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition)
max_its = t; // valid to make threads exit the for?
}
}
Modifying the loop counter works for most implementations of OpenMP worksharing constructs, but the program will no longer be conforming to OpenMP and there is no guarantee that the program works with other compilers.
Since the OP is OK with some extra iterations, OpenMP cancellation will be the way to go. OpenMP 4.0 introduced the "cancel" construct exactly for this purpose. It will request termination of the worksharing construct and teleport the threads to the end of it.
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition) {
#pragma omp cancel for
}
#pragma omp cancellation point for
}
}
Be aware that might there might be a price to pay in terms of performance, but you might want to accept this if the overall performance is better when aborting the loop.
In pre-4.0 implementations of OpenMP, the only OpenMP-compliant solution would be to have an if statement to approach the regular end of the loop as quickly as possible without execution the actual loop body:
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
if(!some condition) {
... loop body ...
}
}
}
Hope that helps!
Cheers,
-michael
You can't modify max_its as the standard says it must be a loop invariant expression.
What you can do, though, is using a boolean shared variable as a flag:
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel for schedule(dynamic, 1) shared(found)
for(int t = 0; t < max_its; ++t)
{
if( ! found ) {
...
}
if(some condition) {
#pragma omp atomic
found = true; // valid to make threads exit the for?
}
}
}
A logic of this kind may be also implemented with tasks instead of a work-sharing construct. A sketch of the code would be something like the following:
void algorithm(int t, bool& found) {
#pragma omp task shared(found)
{
if( !found ) {
// Do work
if ( /* conditionc*/ ) {
#pragma omp atomic
found = true
}
}
} // task
} // function
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel
{
#pragma omp single
{
for(int t = 0; t < max_its; ++t)
{
algorithm(t,found);
}
} // single
} // parallel
}
The idea is that a single thread creates max_its tasks. Each task will be assigned to a waiting thread. If some of the tasks find a valid solution, then all the others will be informed by the shared variable found.
If some_condition is a logical expression that is "always valid", then you could do:
for(int t = 0; t < max_its && !some_condition; ++t)
That way, it's very clear that !some_condition is required to continue the loop, and there is no need to read the rest of the code to find out that "if some_condition, loop ends"
Otherwise (for example if some_condition is the result of some calculation inside the loop and it's complicated to "move" the some_condition to the for-loop condition, then using break is clearly the right thing to do.