OpenMP and shared variable in Fortran which are not shared - fortran

I encounter a problem with OpenMP and shared variables I cannot understand. Everything I do is in Fortran 90/95.
Here is my problem: I have a parallel region defined in my main program, with the clause DEFAULT(SHARED), in which I call a subroutine that does some computation. I have a local variable (an array) I allocate and on which I do the computations. I was expecting this array to be shared (because of the DEFAULT(SHARED) clause), but it seems that it is not the case.
Here is an example of what I am trying to do and that reproduce the error I get:
program main
!$ use OMP_LIB
implicit none
integer, parameter :: nx=10, ny=10
real(8), dimension(:,:), allocatable :: array
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP SINGLE
allocate(array(nx,ny))
!$OMP END SINGLE
!$OMP WORKSHARE
array = 1.
!$OMP END WORKSHARE
call compute(array,nx,ny)
!$OMP SINGLE
deallocate(array)
!$OMP END SINGLE
!$OMP END PARALLEL
contains
!=============================================================================
! SUBROUTINES
!=============================================================================
subroutine compute(array, nx, ny)
!$ use OMP_LIB
implicit none
real(8), dimension(nx,ny) :: array
integer :: nx, ny
real(8), dimension(:,:), allocatable :: q
integer :: i, j
!$OMP SINGLE
allocate(q(nx,ny))
!$OMP END SINGLE
!$OMP WORKSHARE
q = 0.
!$OMP END WORKSHARE
print*, 'q before: ', q(1,1)
!$OMP DO SCHEDULE(RUNTIME)
do j = 1, ny
do i = 1, nx
if(mod(i,j).eq.0) then
q(i,j) = array(i,j)*2.
else
q(i,j) = array(i,j)*0.5
endif
end do
end do
!$OMP END DO
print*, 'q after: ', q(1,1)
!$OMP SINGLE
deallocate(q)
!$OMP END SINGLE
end subroutine compute
!=============================================================================
end program main
When I execute it like that, I get a segmentation fault, because the local array q is allocated on one thread but not on the others, and when the others try to access it in memory, it crashes.
If I get rid of the SINGLE region the local array q is allocated (though sometimes it crashes, which make sense, if different threads try to allocate it whereas it is already the case (and actually it puzzles me why it does not crash everytime)) but then it is clearly as if the array q is private (therefore one thread returns me the expected value, whereas the others return me something else).
It really puzzled me why the q array is not shared although I declared my parallel region with the clause DEFAULT(SHARED). And since I am in an orphaned subroutine, I cannot declare explicitely q as shared, since it is known only in the subroutine compute... I am stuck with this problem so far, I could not find a workaround.
Is it normal? Should I expect this behaviour? Is there a workaround? Do I miss something obvious?
Any help would be highly appreciated!

q is an entity that is "inside a region but not inside a construct" in terms of OpenMP speak. The subroutine that q is local to is in a procedure that is called during a parallel construct, but q itself does not lexically appear in between the PARALLEL and END PARALLEL directives.
The data sharing rules for such entities in OpenMP then dictate that q is private.
The data sharing clauses such as DEFAULT(SHARED), etc only apply to things that appear in the construct itself (things that lexically appear in between the PARALLEL and END PARALLEL). (They can't apply to things in the region generally - procedures called in the region may have been separately compiled and might be called outside of any parallel constructs.)

The array q is defined INSIDE the called subroutine. Every thread calls this subroutine independently and therefore every thread will have it's own copy. The shared directive in the outer subroutine cannot change this. Try to declare it with the save attribute.

Related

Openmp nested parallelism use available threads

So, I have this simple Fortran do loop and inside that loop a couple of subroutines are called. I have
Made the do loop parallel with OpenMP, like this
!$omp parallel do
do i=1,n
call a()
call b()
enddo
!$omp end parallel do
Now most of the times the number of iterations in the loop is
less compared to the number of processor/threads available and the subroutines that are called inside the
loop can be called in parallel. So, is there a way to call the subroutines in parallel inside the parallel
do loop ? I have tried with task like this
!$omp parallel do
do i=1,n
!$omp task
call a(i , j )
!$omp end task
!$omp task
call b(i, k)
!$omp end task
!$omp taskwait
enddo
!$omp end parallel do
But this shows some error with segmentation fault. Is there any way to achieve this.
UPDATE:
So, I found out the main reason for the segmentation fault is coming from the fftw library. Lets consider a dummy program
program name
!$use omp_lib
implicit real*8(a-h,p-z)
call system_clock(count_rate=irate)
call system_clock(it1)
!$ call omp_set_nested(.true.)
!$omp parallel do
do i =1,5
call test(i)
print *, i
enddo
!$omp end parallel do
call system_clock(it2)
print *, (it2-it1)/real(irate, kind=8)
end program name
subroutine test(ii)
! just a dummy subroutine for heavy computation
implicit real*8(a-h,p-z)
do j=1,40000
!$omp task
do k=1,40000
x = exp(sqrt(sqrt(2.0d0*ii**3)**2))
enddo
!$omp end task
enddo
end subroutine
This program works exactly what I wants and using the task directives, uses the remaining threads and improves the performance. Now lets consider another dummy program but with fftw, similar to what I'm working.
program name
!$use omp_lib
implicit real*8(a-h,p-z)
integer, parameter :: n=8192*8
complex(kind=8) :: arr(n)
real(kind=8) :: tmp1(n), tmp2(n)
integer(kind=8) :: pF
integer :: i
call system_clock(count_rate=irate)
call dfftw_plan_dft_1d(pF,n,arr,arr,-1,0) ! forward
call system_clock(it1)
!$ call omp_set_nested(.true.)
!$omp parallel do private(arr)
do i =1,5
call random_number(tmp1)
call random_number(tmp2)
arr = cmplx(tmp1, tmp2, kind=8)
call test(pF, arr)
print *, i
enddo
!$omp end parallel do
call system_clock(it2)
print *, (it2-it1)/real(irate, kind=8)
end program name
subroutine test(pF, arr)
implicit real*8(a-h,p-z)
complex(kind=8) :: arr(:)
integer(kind=8) :: pF
do j=1,100
!$omp task private(arr)
do k=1, 100
call dfftw_execute_dft(pF, arr, arr)
enddo
!$omp end task
enddo
end subroutine
Now, this throws the segmentation fault. (NOTE: I have no random numer call in my actual program, they are here just for a dummy purpose). I have checked http://www.fftw.org/fftw3_doc/Thread-safety.html and fftw_execute is thread safe and the program works without the task directives. But with the task it throws error. Anyone knows how to fix this ?
Sigh, yet another example of why !$omp do parallel is a bad idea ... I really do think it is best to clearly separate the thread creation and worksharing phases.
As Vladimir says in the comments you haven't provided nearly enough detail to tell why you are getting a segmentation fault. However you seem to have a few misconceptions about OpenMP which I can try to address.
Firstly a very quick and dirty way to achieve what you want and avoiding any extra OpenMP directives is
!$omp parallel default( none ) private( i ) shared( n ) ! Create threads
!$omp do ! Now share out the work
Do i = 1, 2 * n
If( Mod( i, 2 ) == 1 ) Then
Call a
Else
Call b
End Do
!$omp end do
!$omp end parallel
However if you want to use tasks you're probably not doing it the easiest way if all calls to a and b are completely independent. In that case remember that a new task is created whenever ANY thread hits a !$omp task, and that that task can be executed by any thread, not just the one that created it. Following that logic something like
!$omp parallel default( none ) private( i ) shared( n ) ! Crate the threads
!$omp single
Do i = 1, n
!$omp task
Call a
!$omp end task
!$omp task
call b
!$omp end task
end do
!$omp end single
!$omp end parallel
is what you want - you use one thread to create the list of tasks, and then (or more probably while the list is being created) all the available threads will execute them, each task being taken by the next available thread. Note I have also missed out the taskwait directive as from your description I'm not sure why you think you need it as I can see no need for synchronisation at that point.

Fortran & OpenMP: How to declare and allocate an allocatable THREADPRIVATE array

I had a serial code where I would declare a bunch of variables in modules and then use those modules across the rest of my program and subroutines. Now I am trying to parallelize this code. There is a portion of the code that I want to run in parallel which seems to be working except for one array, gtmp. I want each thread to have it's own version of gtmp and I want that version to be private to its respective thread, so I've used the threadprivate directive. gtmp is only used inside the parallel region of the code or within subroutines that are only called from the parallel part of the code.
At first I allocated gtmp in a serial portion of the code before the parallel portion, but that was an issue because then only the master thread 'version' of gtmp got allocated and the other thread 'versions' of gtmp had a size of 1 rather than the expected allocated size of gtmp, (this was shown by the "test" print statement). I think this happened because the master thread is the only thread executing code in the serial portions. So, I moved the allocate line into the parallel region, which allowed all threads to have appropriately sized/allocated gtmp arrays, but since my parallel region is inside a loop I get an error when the program tries to allocate gtmp a second time in the second iteration of the r loop.
Note: elsewhere in the code all the other variables in mymod are given values.
Here is a simplified portion of the code that is having the issue:
module mymod
integer :: xBins, zBins, rBins, histCosThBins, histPhiBins, cfgRBins
real(kind=dp),allocatable :: gtmp(:,:,:)
end module mymod
subroutine compute_avg_force
use mymod
implicit none
integer :: r, i, j, ip
integer :: omp_get_thread_num, tid
! I used to allocate 'gtmp' here.
do r = 1, cfgRBins
!$omp PARALLEL DEFAULT( none ) &
!$omp PRIVATE( ip, i, j, tid ) &
!$omp SHARED( r, xBins, zBins, histCosThBins, histPhiBins )
allocate( gtmp(4,0:histCosThBins+1,0:histPhiBins+1) )
tid = omp_get_thread_num() !debug
print*, 'test', tid, histCosThBins, histPhiBins, size(gtmp)
!$omp DO SCHEDULE( guided )
do ip = 1, (xBins*zBins)
call subroutine_where_i_alter_gtmp(...)
...code to be executed in parallel using gtmp...
end do !ip
!$omp END DO
!$omp END PARALLEL
end do !r
end subroutine compute_avg_force
So, the issue is coming from the fact that I need all threads to be active, (ie. in a parallel region), to appropriately initialize all 'versions' of gtmp but my parallel region is inside a loop and I can't allocate gtmp more than once.
In short, what is the correct way to allocate gtmp in this code? I've thought that I could just make another omp parallel region before the loop and use that to allocate gtmp but that seems clunky so I'm wondering what the "right" way to do something like this is.
Thanks for the help!

Global Variables in Fortran OpenMP

Why the following code in Fortran only works if I put the loop variables 'i' and 'j' as input arguments of the subroutine 'mat_init'? The loop variables 'i' and 'j' are declared as private, so shouldn't they remain private inside the subroutine when I call it?
program main
use omp_lib
implicit none
real(8), dimension(:,:), allocatable:: A
integer:: i, j, n
n = 20
allocate(A(n,n)); A(:,:) = 0.0d+00
!$omp parallel do private(i, j)
do i=1,n
do j=1,n
call mat_init
end do
end do
do i=1,n
write(*,'(20f7.4)') (A(i,j), j=1,n)
end do
contains
subroutine mat_init
A(i,j) = 1.0d+00
end subroutine
end program main
I know this have something to do with the 'lexical' and 'dynamic' extend, but I don't understand why OpenMP is implemented in this way to don't recognize private variables in the 'dymanic' extend inside de parallel regions. For me it seems not to be logical or am I doing anything wrong?
First, I think that the subroutine mat_init should takes the value of i and j as input arguments explicitly. Then, the value of i and j must be private, because each thread works on a specific value of i and j. I think also that openmp recognizes that i is private because the parallelized loop is on i. Idem for j. However, this work for the global variables i and j and not for those ones who are internal to the subroutine. Thus, you have to specify that i and j are private in order to force the subroutine internal variables to inhiritate of this aspect.
I believe that the problem is due to the reentrance of the subroutine mat_init. Indeed, what happen when multiple threads enter the subroutine at the same time with different value of i and j? If you don't do any special thing, the called subroutine might not recognize the private aspect of i and j.
In general, it is not welcomed to call many times a subroutine inside a loop, because each call requires a given time. I suggest to write a subroutine that is parallelized rather than call a subroutine within a parallelized section.

How to efficiently parallelize a linked list using OpenMP (using tasks?)

This is a long post -- a lot of background before the question. The quick version is that I've tried to use OpenMP on elements of a linked list -- using OpenMP tasks in a way I've seen prescribed elsewhere, but that leads to a significant slowdown. However, I can get a significant speedup if I divide things up differently, but I'm wondering if there's a way to get the first way to work, since it's cleaner/simpler and (I think) it dynamically balances work across the threads.
I've got a reasonably long linked list (can be a couple million elements) of Fortran types (C structures) and -- several times -- I've got to iterate over the list and operate on each of the elements. So, I've got a subroutine (eachPhonon) that takes a subroutine as an argument (srt) and operates that on each element of the list:
subroutine eachPhonon(srt)
external :: srt
type(phonon), pointer :: tptr
tptr => head
do while(associated(tptr))
call srt(tptr)
tptr => tptr%next
enddo
endsubroutine
It seems like this is a good place for a parallel speedup, since each call of srt can be done independently of the others. This would be very simple using openmp if I had a Fortran do (C for) loop. However, I've seen a method for how to do it using a linked list, both on stackoverflow and from intel. Basically, it makes each call to srt it's own task -- something like:
subroutine eachPhonon(srt)
external :: srt
type(phonon), pointer :: tptr
tptr => head
!$OMP PARALLEL
!$OMP SINGLE
do while(associated(tptr))
!$OMP TASK FIRSTPRIVATE(tptr)
call srt(tptr)
!$OMP END TASK
tptr => tptr%next
enddo
!$OMP END SINGLE
!$OMP END PARALLEL
endsubroutine
This seems to work, but it's significantly slower than using just one thread.
I rewrote things so that given, say, 4 threads, one thread would operate on elements 1,5,9..., another on elements 2,6,10..., etc.:
subroutine everyNth(srt, tp, n)
external :: srt
type(phonon), pointer :: tp
integer :: n, j
do while(associated(tp))
call srt(tp)
do j=1,n
if(associated(tp)) tp => tp%next
enddo
enddo
endsubroutine
subroutine eachPhononParallel(srt)
use omp_lib
external :: srt
type(phonon), pointer :: tp
integer :: j, nthreads
!$OMP PARALLEL
!$OMP SINGLE
nthreads = OMP_GET_NUM_THREADS()
tp => head
do j=1,nthreads
!$OMP TASK FIRSTPRIVATE(tp)
call everyNth(srt, tp, nthreads)
!$OMP END TASK
tp => tp%next
enddo
!$OMP END SINGLE
!$OMP END PARALLEL
endsubroutine
This can lead to a significant speedup.
Is there a way to make the first method efficient?
I'm new to parallel processing, but my reading is that the first method has too much overhead since it tries to make a task for each element. The second way only makes one task for each thread and avoids that overhead. The downside is somewhat less clean code that can't be compiled without openmp, and it won't dynamically balance work across the threads -- it's all statically assigned at the beginning.
If the granularity of your parallelism is too fine, you may try to operate on chunks of a bigger size:
subroutine eachPhonon(srt,chunksize)
external :: srt
integer, intent(in) :: chunksize
type(phonon), pointer :: tptr
tptr => head
!$OMP PARALLEL
!$OMP SINGLE
do while(associated(tptr))
!$OMP TASK FIRSTPRIVATE(tptr)
! Applies srt(tptr) chunksize times or until
! associated(tptr)
call chunk_srt(tptr,chunksize)
!$OMP END TASK
! Advance tptr chunksize times if associated(tptr)
advance(tprt,chunksize)
enddo
!$OMP END SINGLE
!$OMP END PARALLEL
endsubroutine
The idea is to set chunksize to a value big enough to mask the overhead that is associated with task creation.
The slowdown means that srt() takes too little time to execute and therefore the overhead swamps the possible parallel speed-up. Besides Massimiliano's solution, you can also convert the linked list into an array of pointers and then use PARALLEL DO on the resultant structure:
type phononptr
type(phonon), pointer :: p
endtype phononptr
...
subroutine eachPhonon(srt)
external :: srt
type(phonon), pointer :: tptr
type(phononptr), dimension(:), allocatable :: ptrs
integer :: i
allocate(ptrs(numphonons))
tptr => head
i = 1
do while(associated(tptr))
ptrs(i)%p => tptr
i = i + 1
tptr => tptr%next
enddo
!$OMP PARALLEL DO SCHEDULE(STATIC)
do i = 1, numphonons
call srt(ptrs(i)%p)
enddo
!$OMP END PARALLEL DO
endsubroutine
If you do not explicitly keep the number of list items in a separate variable (numphonons in this case), you would have to traverse the list twice. The phononptr type is neccessary because Fortran lacks an easier way to declare arrays of pointers.
The same can also be achieved by setting chunksize in Massimiliano's solution to numphonons / omp_get_num_threads().

Openmp Fortran Subroutine

I am trying to parallelize a subroutine using Openmp.
The subroutine contains a successive over relaxation loop which runs on the total
error which is a shared variable. Now, when I parralelize the part where I call the
subroutine in the main program, it makes the error a private variable and then I can't make it explicitly a shared variable in the main program.
I am pasting the code for reference.
program test
!$omp parallel
call sub()
!$omp end parallel
end program test
subroutine sub()
do while(totalerror.ge.0.0001.and.sor.lt.10000)
totalerror=0.0
sor=sor+1
error=0.0
!$OMP DO REDUCTION(+:toterror) REDUCTION(MAX:error)
! shared (vorticity,strmfn,toterror,error,guess) PRIVATE (i,j,t1,t2)
do i=1,nx
do j=1,ny
guess(i,j)=0.25*((h**2.)*vorticity(i,j)+strmfn(i+1,j)+strmfn(i- 1,j)+strmfn(i,j+1)+strmfn(i,j-1))
totalerror = totalerror + error
error = max(abs(strmfn(`enter code here`i,j) - guess(i,j)),error)
strmfn(i,j)= strmfn(i,j) + omega*(guess(i,j)-strmfn(i,j))
enddo
enddo
!$OMP END DO
enddo
Any help would be appreciated.
toterror and error shouldn't be in the shared clause since they are in the reduction. If you need shared versions, copy them to different variables.