Nested parallel regions in OpenMP and gfortran - fortran

I am trying a very simple openmp fortran program as I have problems with nested parallel regions. The code is as follows:
PROGRAM TEST_OPENMP
IMPLICIT NONE
INTEGER :: omp_get_num_threads, omp_get_thread_num
CALL omp_set_num_threads( 4 )
WRITE(*,*) "I am the master thread"
!$OMP PARALLEL
write(*,*) "Hello, I am in the first parallel region", omp_get_thread_num()
!$OMP PARALLEL
write(*,*) "I am a nested parallel region : ", omp_get_thread_num()
!$OMP END PARALLEL
!$OMP END PARALLEL
write(*,*) "The master thread is back: serial region", omp_get_thread_num()
END PROGRAM TEST_OPENMP
I expected to have 4 instances of the string starting as "Hello, I am in..." and 16 of "I am a nested..." but what I get is the following:
I am the master thread
Hello, I am in the first parallel region 3
Hello, I am in the first parallel region 2
Hello, I am in the first parallel region 1
I am a nested parallel region : 0
I am a nested parallel region : 0
Hello, I am in the first parallel region 0
I am a nested parallel region : 0
I am a nested parallel region : 0
The master thread is back: serial region 0
Therefore the second parallel region is treated as linear. I am compiling with ´gfortran -fopenmp filename´. Am I doing something horribly wrong?

Thanks for the tip Tim. I was using a tutorial that didn't mention OMP_NESTED. Sorry for the dumb question.
Now it works smoothly:
$ export OMP_NESTED=True
$ gfortran -fopenmp openmp_example_3.f90
$ a.out
I am the master thread
Hello, I am in the first parallel region 0
Hello, I am in the first parallel region 3
I am a nested parallel region : 0
I am a nested parallel region : 2
Hello, I am in the first parallel region 2
I am a nested parallel region : 1
Hello, I am in the first parallel region 1
I am a nested parallel region : 3
I am a nested parallel region : 0
I am a nested parallel region : 1
I am a nested parallel region : 1
I am a nested parallel region : 0
I am a nested parallel region : 1
I am a nested parallel region : 3
I am a nested parallel region : 2
I am a nested parallel region : 3
I am a nested parallel region : 0
I am a nested parallel region : 2
I am a nested parallel region : 3
I am a nested parallel region : 2
The master thread is back: serial region 0

Related

How to use two nodes for one OpenMp Fortran90 code in SLURM Cluster?

I am freshly new to using SLURM in CLUSTER.
I am now struggling with OpenMP fortran 90.
I try to calculate integrals using two nodes (node1 and node2) through SLURM.
What I want is to return one value by combining the calculations of node 1 and node 2 using Fortran OpenMP.
However, when I using "srun" it appears that two nodes compute the same executable file independently.
For example, if I run the code as below each node will return two identical values. Besides, if I execute without "srun" then it looks fine, but actually, it is not. When I check the "squeue" command, it seems using 100 CPUs through two nodes. (it looks fine!) But in reality, if I look at the "ssh node# (#=1,2)" and check each of the two nodes, only node1 use 100 CPUs, and node2 was not working.
Is there someone shed light on me?
----source code----
program integral
use omp_lib
implicit none
integer :: i,n
real :: x,y1,y2,xs,xe,dx,sum,dsum
n=100000000
xs=0.
xe=3.
sum=0.
dx=(xe-xs)/real(n)
!$omp parallel do default(shared) private(i,dsum,x) reduction(+:sum)
do i=1,n
x=xs+real(i-1)*dx
y1=x**2
y2=(x+dx)**2
dsum=(y1+y2)*dx/2
sum=sum+dsum
enddo
!$omp end parallel do
print*, sum
end program
----job script----
#!/bin/sh
#SBATCH -J test
#SBATCH -p oldbatch
#SBATCH -o test%j.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=50
export OMP_NUM_THREADS=50
srun ./a.out

gnu parallel and resource management

I would like to use the gnu parallel command line to basically act as a simple scheduling mechanism.
in my case, i have N number of GPU's on a system and i would like to effectively queue a list of jobs onto those GPU's.
basically, i have a list of inputs and i would naively run
parallel --jobs=4 ./my_script.sh ::: cat list_of_things.txt ::: 0 1 2 3
where ./my_script.sh accepts two args the thing i want to process, and the GPU i want to process it on.
what i want is for each thing in the list, to just run on one of the gpus (0 thru 3).
however, this ends up just running each thing 4 times.
Try this:
parallel --jobs=4 ./my_script.sh {%} {} :::: list_of_things.txt

openmp Fortran--- Code showing same performance

I am a new user of openmp. I have written the following code in fortran and tried to add parallel feature to it using openmp. Unfortunately, it is taking same time as serial version of this subroutine. I am compiling it using this f2py command. Am sure, I am missing a key concept here but unable to figure it out. Will really appreciate on getting help on this.
!f2py -c --opt='-O3' --f90flags='-fopenmp' -lgomp -m g3Test g3TestA.f90
exp1 =0.0
exp2 =0.0
exp3 =0.0
!$OMP PARALLEL DO shared(xConfig,s1,s2,s3,c1,c2,c3) private(h)&
!$OMP REDUCTION(+:exp1,exp2,exp3)
do k=0,numRows-1
xConfig(0:2) = X(k,0:2)
do h=0,nPhi-1
exp1(h) = exp1(h)+exp(-((xConfig(0)-c1(h))**2)*s1)
exp2(h) = exp2(h)+exp(-((xConfig(1)-c2(h))**2)*s2)
exp3(h) = exp3(h)+exp(-((xConfig(2)-c3(h))**2)*s3)
end do
end do
!$OMP END PARALLEL DO
ALine = exp1+exp2+exp3
As neatly explained in this OpenMP Performance training course material from the University of Edinburgh for example, there are a number of reasons why OpenMP code does not necessarily scale as you would expect (for example how much of the serial runtime is taken by the part you are parallelising, synchronisation between threads, communication, and other parallel overheads).
You can easily test the performance with different numbers of threads by calling your python script like, e.g. with 2 threads:
env OMP_NUM_THREADS=2 python <your script name>
and you may consider adding the following lines in your code example to get a visual confirmation of the number of threads being used in the OpenMP part of your code:
do k=0,numRows-1
!this if-statement is only for debugging, remove for timing
!$ if (k==0) then
!$ print *, 'num_threads running:', OMP_get_num_threads()
!$ end if
xConfig(0:2) = X(k,0:2)

Nested OpenMP parallel regions not iterating as expected

This is probably a dumb question, but I'm just starting out with OpenMP due to increased data volumes.
I'm going through "Parallel Programming in Fortran 95 using OpenMP" by Miguel Hermanns and am very early in the book. One of the early examples shows the use of nested parallel regions and indicates that it should produce N2 + N lines of output. The procedure looks like this:
program helloworld
!$OMP PARALLEL
write(*,*) "Hello"
!$OMP PARALLEL
write(*,*) "Hi"
!$OMP END PARALLEL
!$OMP END PARALLEL
end program helloworldcode
I would expect 12 Hellos and 144 His, but instead I get 12 of each:
$ ./helloworld.exe
Hello
Hello
Hello
Hi
Hi
Hello
Hello
Hello
Hello
Hello
Hello
Hi
Hi
Hello
Hello
Hi
Hi
Hi
Hi
Hi
Hello
Hi
Hi
Hi
Why am I not getting the 156 lines of output that I would expect?
By default, OpenMP serializes all the nested parallel regions in order to prevent the worst case of quadratic over-subscription when N^2 worker threads are created. With big enough number of processors (say >=16), quadratic over-subscription can ruin execution with nightmare overheads or cause resource exhaustion issue when requested number of threads just cannot be created.
For information how to enable nested parallelism in OpenMP on your risk, please refer to omp_set_nested and corresponding environment variable OMP_NESTED.

OpenMP - Starting a new thread in each loop iteration

I'm having trouble adjusting my thinking to suit OpenMP's way of doing things.
Roughly, what I want is:
for(int i=0; i<50; i++)
{
doStuff();
thread t;
t.start(callback(i)); //each time around the loop create a thread to execute callback
}
I think I know how this would be done in c++11, but I need to be able to accomplish something similar with OpenMP.
The closest thing to what you want are OpenMP tasks, available in OpenMP v3.0 and later compliant compilers. It goes like:
#pragma omp parallel
{
#pragma omp single
for (int i = 0; i < 50; i++)
{
doStuff();
#pragma omp task
callback(i);
}
}
This code will make the loop execute in one thread only and it will create 50 OpenMP tasks that will call callback() with different parameters. Then it will wait for all tasks to finish before exiting the parallel region. Tasks will be picked (possibly at random) by idle threads to be executed. OpenMP imposes an implicit barrier at the end of each parallel region since its fork-join execution model mandates that only the main thread runs outside of parallel regions.
Here is a sample program (ompt.cpp):
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
void callback (int i)
{
printf("[%02d] Task stated with thread %d\n", i, omp_get_thread_num());
sleep(1);
printf("[%02d] Task finished\n", i);
}
int main (void)
{
#pragma omp parallel
{
#pragma omp single
for (int i = 0; i < 10; i++)
{
#pragma omp task
callback(i);
printf("Task %d created\n", i);
}
}
printf("Parallel region ended\n");
return 0;
}
Compilation and execution:
$ g++ -fopenmp -o ompt.x ompt.cpp
$ OMP_NUM_THREADS=4 ./ompt.x
Task 0 created
Task 1 created
Task 2 created
[01] Task stated with thread 3
[02] Task stated with thread 2
Task 3 created
Task 4 created
Task 5 created
Task 6 created
Task 7 created
[00] Task stated with thread 1
Task 8 created
Task 9 created
[03] Task stated with thread 0
[01] Task finished
[02] Task finished
[05] Task stated with thread 2
[04] Task stated with thread 3
[00] Task finished
[06] Task stated with thread 1
[03] Task finished
[07] Task stated with thread 0
[05] Task finished
[08] Task stated with thread 2
[04] Task finished
[09] Task stated with thread 3
[06] Task finished
[07] Task finished
[08] Task finished
[09] Task finished
Parallel region ended
Note that tasks are not executed in the same order they were created in.
GCC does not support OpenMP 3.0 in versions older than 4.4. Unrecognised OpenMP directives are silently ignored and the resulting executable will that code section in serial:
$ g++-4.3 -fopenmp -o ompt.x ompt.cpp
$ OMP_NUM_THREADS=4 ./ompt.x
[00] Task stated with thread 3
[00] Task finished
Task 0 created
[01] Task stated with thread 3
[01] Task finished
Task 1 created
[02] Task stated with thread 3
[02] Task finished
Task 2 created
[03] Task stated with thread 3
[03] Task finished
Task 3 created
[04] Task stated with thread 3
[04] Task finished
Task 4 created
[05] Task stated with thread 3
[05] Task finished
Task 5 created
[06] Task stated with thread 3
[06] Task finished
Task 6 created
[07] Task stated with thread 3
[07] Task finished
Task 7 created
[08] Task stated with thread 3
[08] Task finished
Task 8 created
[09] Task stated with thread 3
[09] Task finished
Task 9 created
Parallel region ended
For example have a look to http://en.wikipedia.org/wiki/OpenMP.
#pragma omp for
is your friend. OpenMP does not need you to think about threading. You just declare(!) what you want to be run in parallel and the OpenMP compatible compiler performs the needed transformations in your code during compile time.
The specifications of OpenMP are also very enligthing. They explain quite well what can be done and how: http://openmp.org/wp/openmp-specifications/
Your sample could look like:
#pragma omp parallel for
for(int i=0; i<50; i++)
{
doStuff();
thread t;
t.start(callback(i)); //each time around the loop create a thread to execute callback
}
Everything in the for loop is run in parallel. You have to pay attention to data dependency. The 'doStuff()' function is run sequentielly in your pseudo code, but would be run in parallel in my sample. You also need to specifiy which variables are thread private and something like that which would also go into the #pragma statement.