Fortran OpenMP: creating threads only once - fortran

I am working in Fortran. My purpose is to parallelize with OPENMP a program of this kind:
do t=1, t_fin (sequential)
a sequence of do loops (not sequential)
end do
where the cycle is sequential (temporal). I have tried to create threads at the beginning of each sequential cycle iteration, but this makes the code slower than it could be. Therefore, my question is if it possible to create the threads only once before starting the cycle. Practically:
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP CRITICAL
do t=1, t_fin
sequence of OMP do
end do
!$OMP END CRITICAL
!$OMP END PARALLEL
I have tried in this way, but it works as if it were only one thread. I suppose this depends on the fact that there is an external critical section that includes the omp do. However, I would like to execute the internal omp do with more than one thread. Is there a way to obtain this?

If I understand your question correctly, you want to avoid creating threads in each iteration of the outer loop. This can be achieved by taking the OMP PARALLEL directive outside the loop, and leave the other statements inside the loop. Then I see two parallelization schemes.
You could either parallelize the inner loops:
!$OMP PARALLEL
do t=1, t_fin (sequential)
!$OMP DO
first loop
!$OMP END DO
!$OMP DO
second loop
!$OMP END DO
!...
end do
!$OMP END PARALLEL
or, use sections to run the loops in parallel (if they are independent of each other):
!$OMP PARALLEL
do t=1, t_fin (sequential)
!$OMP SECTIONS
!$OMP SECTION
first loop
!$OMP END SECTION
!$OMP SECTION
second loop
!$OMP END SECTION
!...
!$OMP END SECTIONS
end do
!$OMP END PARALLEL
In both versions, the threads are created outside the loop.

Related

OMP : Is REDUCTION mandatory for an array when in a DO loop there is no concurrency between cells?

I wrote the following program that works as expected.
program main
use omp_lib
implicit none
integer :: i,j
integer, dimension(1:9) :: myArray
myArray=0
do j=0,3
!$OMP PARALLEL DO DEFAULT(SHARED)
do i=1,9
myArray(i) = myArray(i) + i*(10**j)
end do
!$OMP END PARALLEL DO
enddo
print *, myArray
end program main
As only one thread writes to the i th cell of myArray, REDUCTION on myArray has not been used. But, I wonder if REDUCTION (+ : myArray) must be added in the OMP DO or if it is unuseful ? In others terms, what is important : the array or the cell of the array ?
What is important : the array or the cell of the array ?
Cells. !$OMP PARALLEL DO DEFAULT(SHARED) is fine as long as the loop is embarrassingly parallel. When Threads operate on the same location in memory, this cause a race condition. When a reduction is performed cell-wise, then REDUCTION (+ : myArray) can be added. That being said, once should note that the full array will likely be replicated in each threads temporary before doing the final reduction of the whole array. Since your loop is embarrassingly parallel, REDUCTION (+ : myArray) is not needed here.
Besides, note that the number of iterations is too small for multiple threads to make the code faster. In fact, it will be slower because of the time spent to create/join threads, the time to distribute the work amongst threads and also because of an effect called false-sharing (which will nearly serialize the execution of the loop here).

openmp: seq processing within subroutine

I would like to carry out calculations in parallel by calling
!$OMP PARALLEL
call sub
!$OMP END PARALLEL
this fires of n (= no. of threads) calls tosub, as desired.
within sub, I am reading from separate data files, with the file and the no. of reads determined from the thread number. This works if I enclose everything in sub in !$OMP ORDERED ... !$OMP END ORDERED.
however, this also causes all threads to run strictly sequentially, i.e. thread 0 runs first and all reads in it are complete before thread 1 starts, etc.
what I would like to achieve though is that all n threads run concurrently and only processing within a thread is sequentially (as reads are from different data files). any idea how? replacing ORDERED by TASK does not help.

Openmp: How to collect an array from different threads?

I am a OpenMP newbie and I am stuck at a problem! I have an array which is summed in a loop but I am having problems parallelizing this. Can you suggest how to do this? The main loop is sketched as follows:
REAL A (N) ! N is an integer
!$OMP PARALLEL
DO I=1,10 ! just an illustration
DO J=1,10 ! can't post real code sorry, illegal
!$OMP DO
DO K=1,100
call messy_subroutine_that_sums_A(I,J, K, A) ! each thread returns its own sum
! each thread returns its own A right? and needs to summed quickly
END DO
!$OMP END DO
END DO
END DO
SUBROUTINE messy_subroutine_that_sums_A(I,J, K, A)
REAL A(N) ! an array
! basically adds some stuff to the previous value of A
A=A+junk ! this is really junk
END SUBROUTINE messy_subroutine_that_sums_A
my problem is that all my attempts to collect A from all the threads have failed. If you notice A is summed over outer loops as well. What is the correct and a fast procedure to collect A from all the arrays as a sum. Secondly, my question is not just a Fortran question, it applies equally to C and C++. It is a conceptual question.
Actually OpenMP does support reduction on arrays in C, C++ (declared recently in OpenMP 4.1 comment draft release), and of cause in Fortran.
Not all implementations may support this feature, so you would better first check if you compiler supports it. To see if you have correct calculations you can start with placing A=A+junk into critical section:
!$omp critical
A=A+junk
!$omp end critical
This should give you same correct answer in all threads after the OMP DO loop.
Then you can optimize performance using array reduction instead of critical on OMP DO loop, again having same correct answer in all threads after the loop.
Then you can further optimize performance moving the reduction to OMP PARALLEL, but you won't be able to check values in all threads in this case, because all threads will work with private copies of the array A, and thus have different partial values. Final correct answer will be available in master thread only, after the OMP PARALLEL.
Note, you don't need to declare loop iteration variables private in Fortran, as they should be automatically made private.

Understanding #pragma omp parallel

I am reading about OpenMP and it sounds amazing. I came at point where the author states that #pragma omp parallel can be used to create a new team of threads. So I wanted to know what difference does #pragma omp parallel mean here. I read that #pragma omp for uses the current team of threads to process a for loop.So I have two examples
First simple example:
#pragma omp for
for(int n=0; n<10; ++n)
{
printf(" %d", n);
}
printf(".\n");
Second example
#pragma omp parallel
{
#pragma omp for
for(int n=0; n<10; ++n) printf(" %d", n);
}
printf(".\n");
My question is are those thread created on the fly every time or once when an application starts also when or why would I want to create a team of more threads ?
Your first example wouldn't compile like that. The "#pragma omp for" advises the compiler to distribute the work load of the following loop within the team of threads which you have to create first. A team of threads is created with the "#pragma omp parallel" statement as you use it in the second example. You can combine the "omp parallel" and "omp for" directives by using "#pragma omp parallel for"
The team of threads are created after the parallel statement and are valid within this block.
TL;DR: The only difference is that the first code calls two implicit barriers whereas the second calls only one.
A more detail answer using as reference the modern official OpenMP 5.1 standard.
The
#pragma omp parallel:
will create a parallel region with a team of threads, where each thread will execute the entire block of code that the parallel region encloses.
From the OpenMP 5.1 one can read a more formal description :
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region (..). The
thread that encountered the parallel construct becomes the primary
thread of the new team, with a thread number of zero for the duration
of the new parallel region. All threads in the new team, including the
primary thread, execute the region. Once the team is created, the
number of threads in the team remains constant for the duration of
that parallel region.
The:
#pragma omp parallel for
will create a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.
From the OpenMP 5.1 you can read a more formal description :
The worksharing-loop construct specifies that the iterations of one or
more associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks. The iterations are
distributed across threads that already exist in the team that is
executing the parallel region to which the worksharing-loop region
binds.
Moreover,
The parallel loop construct is a shortcut for specifying a parallel
construct containing a loop construct with one or more associated
loops and no other statements.
Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for.
Both versions that you have with a chunk_size=1 and a static schedule would result in something like:
Code-wise the loop would be transformed to something logically similar to:
for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads())
{
//...
}
where omp_get_thread_num()
The omp_get_thread_num routine returns the thread number, within the
current team, of the calling thread.
and omp_get_num_threads()
Returns the number of threads in the current team. In a sequential
section of the program omp_get_num_threads returns 1.
or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.
À "parallel" region can contains more than a simple "for" loop.
At the 1st time your program meet "parallel" the open MP thread team will be create, after that, every open mp construct will reuse those thread for loop, section, task, etc.....

OpenMP construct to continue execution as soon as at least 1 thread is finished

I have a need to continue execution as soon as one of the threads has finished execution. The logic inside the parallel section with ensure that everything has been completed satisfactorily. I have nested parallelisation therefore I put some of the top level threads to Sleep when data is not ready to be processed as not to consume computation power. So when one of the top level threads finishes I want to continue execution and not wait for the other threads to wake up and naturally return.
I use
#pragma omp parallel for num_threads(wanted_thread_no)
How do you parallelise? Do you use tasks, sections or?
If I understood correct and if you using the task primitive you can use the #pragma omp parallel nowait after the last task.
Check this pdf on page 13 (of the pdf).
http://openmp.org/wp/presos/omp-in-action-SC05.pdf
It explicitly says:
By default, there is a barrier at the end of the “omp for”. Use the
“nowait” clause to turn off the barrier.
#pragma omp for nowait “nowait” is useful between two consecutive, independent omp for loops.
Is this what you want?
Also take a look on this as well, even if it says the same thing.
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf