Fortran arrays in hybrid MPI/OpenMP - fortran

I am facing the following issue when running a hybrid MPI/OpenMP
code with GNU and Intel compilers and OpenMPI. The code is big (commercial)
written in Fortran. It compiles and runs fine with GNU compilers
but crashes with Intel compilers.
I have monitored the part of the code when the program stops working,
it has the following structure:
subroutine test(n,dy,dy)
integer :: i
integer, parameter :: n=6
real*8 :: dx(num),dy(num), ener
ener=0.0
!$omp parallel num_threads(2)
!$omp do
do i=1,100
ener = ener + funct(n,dx,dy) + i
enddo
!$omp end do
!$omp end parallel
end subroutine test
and the function funct has this structure:
real*8 function funct(n,dx,dy)
integer :: n
real*8 :: dx(*),dy(*)
funct = 0.0
do i=1,n
funct = funct + dx(i)+dy(i)
enddo
end function funct
Specifically the code stops inside funct (with Intel). The
program is able to get the end of funct but only one thread
of the two requested is able to return the value, I checked
that by printing the thread numbers.
This issue is only for Intel compilers, for GNU I don't get
the issue.
One way to avoid the issue, I found, is by using plain arrays
inside funct as follows:
real*8 function funct(n,dx,dy)
integer :: n
real*8 :: dx(n),dy(n)
but my point is that I don't understand what is happening.
My guess is that in the Intel case, the compiler cannot
figure out the length of dx and dy inside funct but I am
not sure. I tried to reproduce this issue with a small
Fortran program but I was not able to see that issue.
Any comment is welcome.
One update: I eliminated the issue with the race condition (this is
not the real problem, what I wrote here was the structure of the code).
I realized that subroutinetest is being called from another subroutine
upper which defines dx,dy as pointers:
subroutine upper
real*8,save,pointer :: dx(:)=>Null(), dy(:)=>Null()
....
call test(n,dx,dy)
...
end subroutine upper
what I did now, was to replace pointers by allocatables:
subroutine upper
real*8,save,dimension(:),allocatable :: dx,dy
....
allocate(dx(n),dy(n))
call test(n,dx,dy)
...
end subroutine upper
and I don't get the issue with Intel. I don't know what could be the
difference between pointers and allocatables.
Thanks.

Related

Is there a clear reason this loop will not vectorize?

I am assessing the performance of a Fortran 90 code. Running the code through Intel's Advisor program I see that loops with the following style are not getting vectorized. An example of the loop structure is shown in the Subroutine and module files described below.
The code is being compiled with Intel's Compiler 19.0.3
-O3 optimization turned on
Subroutine SampleProblem
Use GlobalVariables
Implicit None
Integer :: ND, K, LP, L
Real :: AVTMP
! Sample of loop structure that is no vectorized
DO ND=1,NDM
DO K=1,KS
DO LP=1,LLWET(K,ND)
L = LKWET(LP,K,ND)
AVTMP = AVMX*HPI(L)
ENDDO
ENDDO
ENDDO
End Subroutine SampleProblem
LLWET and LKWET are allocatable arrays declared in a module 'GlobalVariables'. Something like:
Module GlobalVariables
Implicit None
! Variable declarations
REAL :: AVMX
INTEGER :: NDM
REAL,ALLOCATABLE,DIMENSION(:) :: HPI
INTEGER,ALLOCATABLE,DIMENSION(:,:) :: LLWET
INTEGER,ALLOCATABLE,DIMENSION(:,:,:) :: LKWET
End Module GlobalVariables
I don't see why this loop would not get vectorized by the compiler. There are many loops like his all over the code and none of them get vectorized, per the reported results of Intel's Advisor. I have tried forcing vecotoriztion with a !$SIMD block around the loop.

what is difference between (ulimit -s unlimited) and (export KMP_STACKSIZE = xx)?

I ran my program like below, and used ( ulimit -s unlimited ).
It works.
REAL(DP), DIMENSION(1024,2,1541) :: L_X TanV
REAL(DP), DIMENSION(4) :: Val_X, Val_Y
REAL(DP), dimension(1029) :: E_x
REAL(DP), dimension(1024) :: E_y
REAL(DP), DIMENSION(1024,1024) :: E_Fx, E_Fy
!$OMP SECTIONS PRIVATE(i, j, ii,jj, PSL_X, i_x, i_y, Val_X, Val_Y)
!$OMP SECTION
do j=1,LinkPlusBndry
do i=1,Kmax(j)-1
PSL_X(1)=modulo(L_X(i,1,j),H*N2); PSL_X(2)=L_X(i,2,j)
i_x=floor(PSL_X(1)/H)+2; i_y=floor(PSL_X(2)/H)
call Delta4((E_x(i_x:i_x+3)-PSL_X(1))/H,Val_X)
call Delta4((E_y(i_y:i_y+3)-PSL_X(2))/H,Val_Y)
do ii=1,4; do jj=1,4
EE_Fx(i_y+ii-1,i_x+jj-1)=EE_Fx(i_y+ii-1,i_x+jj-1) &
+tauH2*TanV(i,1,j)*Val_X(jj)*Val_Y(ii)
end do; end do
end do
end do
...
...
...
!$OMP SECTION
do j=1,LinkPlusBndry
do i=1,Kmax(j)-1
PSL_X(1)=modulo(L_X(i,1,j),H*N2); PSL_X(2)=L_X(i,2,j)
i_x=floor(PSL_X(1)/H)+2; i_y=floor(PSL_X(2)/H)
call Delta4((E_x(i_x:i_x+3)-PSL_X(1))/H,Val_X)
call Delta4((E_y(i_y:i_y+3)-PSL_X(2))/H,Val_Y)
do ii=1,4; do jj=1,4
EE_Fy(i_y+ii-1,i_x+jj-1)=EE_Fy(i_y+ii-1,i_x+jj-1) &
+tauH2*TanV(i,2,j)*Val_X(jj)*Val_Y(ii)
end do; end do
end do
end do
!$OMP END SECTIONS
I don't like using !$OMP SECTION, it restricts the speed by using only 2 threads.
So I had changed my code like below.
!$OMP DO PRIVATE(j, i, PSL_X, i_x, i_y, ii, jj, Val_X, Val_Y) REDUCTION(+:EE_Fx, EE_Fy)
do j=1,LinkPlusBndry
do i=1,Kmax(j)-1
PSL_X(1)=modulo(L_X(i,1,j),H*N2); PSL_X(2)=L_X(i,2,j)
i_x=floor(PSL_X(1)/H)+2; i_y=floor(PSL_X(2)/H)
call Delta4((E_x(i_x:i_x+3)-PSL_X(1))/H,Val_X)
call Delta4((E_y(i_y:i_y+3)-PSL_X(2))/H,Val_Y)
do ii=1,4; do jj=1,4
EE_Fx(i_y+ii-1,i_x+jj-1)=EE_Fx(i_y+ii-1,i_x+jj-1) &
+tauH2*TanV(i,1,j)*Val_X(jj)*Val_Y(ii)
EE_Fy(i_y+ii-1,i_x+jj-1)=EE_Fy(i_y+ii-1,i_x+jj-1) &
+tauH2*TanV(i,2,j)*Val_X(jj)*Val_Y(ii)
end do; end do
PSL_X(1)=modulo(L_X(i+1,1,j),H*N2); PSL_X(2)=L_X(i+1,2,j)
i_x=floor(PSL_X(1)/H)+2; i_y=floor(PSL_X(2)/H)
call Delta4((E_x(i_x:i_x+3)-PSL_X(1))/H,Val_X)
call Delta4((E_y(i_y:i_y+3)-PSL_X(2))/H,Val_Y)
do ii=1,4; do jj=1,4
EE_Fx(i_y+ii-1,i_x+jj-1)=EE_Fx(i_y+ii-1,i_x+jj-1) &
-tauH2*TanV(i,1,j)*Val_X(jj)*Val_Y(ii)
EE_Fy(i_y+ii-1,i_x+jj-1)=EE_Fy(i_y+ii-1,i_x+jj-1) &
-tauH2*TanV(i,2,j)*Val_X(jj)*Val_Y(ii)
end do; end do
end do
end do
!$OMP END DO
when I launch this code, I get segmentation fault.
I thought it was related with the memory size.
So, after searching I found this solution
export KMP_STACKSIZE=value
Now I use 2 different commands
ulimit -s unlimited
and
export KMP_STACKSIZE=value
It works well, but I don't know difference between the two commands.
What is the difference?
ulimit sets the OS limits for the program.
KMP_STACKSIZE tells the OpenMP implementation about how much stack to actually allocate for each of the stacks. So, depending on your OS defaults you might need both. BTW, you should rather use OMP_STACKSIZE instead, as KMP_STACKSIZE is the environment variable used by the Intel and clang compilers. OMP_STACKSIZE is the standard way of setting the stack size of the OpenMP threads.
Note, that this problem is usually more exposed, as Fortran tends to keep more data on the stack, esp. arrays. Some compilers can move such arrays to the heap automatically, see for instance -heap-arrays for the Intel compiler.

Program stops due to array allocation in a function [duplicate]

The following code is returning a Segmentation Fault because the allocatable array I am trying to pass is not being properly recognized (size returns 1, when it should be 3). In this page (http://www.eng-tips.com/viewthread.cfm?qid=170599) a similar example seems to indicate that it should work fine in F95; my code file has a .F90 extension, but I tried changing it to F95, and I am using gfortran to compile.
My guess is that the problem should be in the way I am passing the allocatable array to the subroutine; What am I doing wrong?
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
PROGRAM test
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
IMPLICIT NONE
DOUBLE PRECISION,ALLOCATABLE :: Array(:,:)
INTEGER :: iii,jjj
ALLOCATE(Array(3,3))
DO iii=1,3
DO jjj=1,3
Array(iii,jjj)=iii+jjj
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
CALL Subtest(Array)
END PROGRAM
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
SUBROUTINE Subtest(Array)
DOUBLE PRECISION,ALLOCATABLE,INTENT(IN) :: Array(:,:)
INTEGER :: iii,jjj
PRINT*,SIZE(Array,1),SIZE(Array,2)
DO iii=1,SIZE(Array,1)
DO jjj=1,SIZE(Array,2)
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
END SUBROUTINE
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
If a procedure has a dummy argument that is an allocatable, then an explicit interface is required in any calling scope.
(There are numerous things that require an explicit interface, an allocatable dummy is but one.)
You can provide that explicit interface yourself by putting an interface block for your subroutine inside the main program. An alternative and far, far, far better option is to put the subroutine inside a module and then USE that module in the main program - the explicit interface is then automatically created. There is an example of this on the eng-tips site that you provided a link to - see the post by xwb.
Note that it only makes sense for a dummy argument to have the allocatable attribute if you are going to do something related to its allocation status - query its status, reallocate it, deallocate it, etc.
Please also note that your allocatable dummy argument array is declared with intent(in), which means its allocation status will be that of the associated actual argument (and it may not be changed during the procedure). The actual argument passed to your subroutine may be unallocated and therefore illegal to reference, even with an explicit interface. The compiler will not know this and the behaviour of inquiries like size is undefined in such cases.
Hence, you first have to check the allocation status of array with allocated(array) before referencing its contents. I would further suggest to implement loops over the full array with lbound and ubound, since in general you can't be sure about array's bounds:
subroutine subtest(array)
double precision, allocatable, intent(in) :: array(:,:)
integer :: iii, jjj
if(allocated(array)) then
print*, size(array, 1), size(array, 2)
do iii = lbound(array, 1), ubound(array, 1)
do jjj = lbound(array, 2), ubound(array, 2)
print*, array(iii,jjj)
enddo
enddo
endif
end subroutine
This is a simple example that uses allocatable dummy arguments with a module.
module arrayMod
real,dimension(:,:),allocatable :: theArray
end module arrayMod
program test
use arrayMod
implicit none
interface
subroutine arraySub
end subroutine arraySub
end interface
write(*,*) allocated(theArray)
call arraySub
write(*,*) allocated(theArray)
end program test
subroutine arraySub
use arrayMod
write(*,*) 'Inside arraySub()'
allocate(theArray(3,2))
end subroutine arraySub

Openmp parallel workshare for allocatable array

I want to do some element-wise calculation on arrays in Fortran 90, while parallelize my code with openmp. I have now the following code :
program test
implicit none
integer,parameter :: n=50
integer :: i
integer(8) :: t1,t2,freq
real(8) :: seq(n),r(n,n,n,n)
real(8),dimension(n,n,n,n) :: x
call system_clock(COUNT_RATE=freq)
seq=[(i,i=1,n)]
x=spread(spread(spread(seq,2,n),3,n),4,n)
call system_clock(t1)
!$omp parallel workshare
! do some array calculation
r=atan(exp(-x))
!$omp end parallel workshare
call system_clock(t2)
print*, sum(r)
print '(f6.3)',(t2-t1)/real(freq)
end program test
I want now to replace the static arrays x and r with allocatable arrays, so I type :
real(8),dimension(:,:,:,:),allocatable :: x,r
allocate(x(n,n,n,n))
allocate(r(n,n,n,n))
but that the program run in serial without errors and the compiler doesn't take account of the line "!$omp parallel workshare".
What options should I use to parallelize in this case? I have tried with omp parallel do with loops but it is much slower.
I am compiling my code with gfortran 5.1.0 on windows :
gfortran -ffree-form test.f -o main.exe -O3 -fopenmp -fno-automatic
I have come across this issue in gfortran before. The solution is to specify the array in the following form:
!$omp parallel workshare
! do some array calculation
r(:,:,:,:) = atan(exp(-x))
!$omp end parallel workshare
Here is the reference.

OpenMP and shared variable in Fortran which are not shared

I encounter a problem with OpenMP and shared variables I cannot understand. Everything I do is in Fortran 90/95.
Here is my problem: I have a parallel region defined in my main program, with the clause DEFAULT(SHARED), in which I call a subroutine that does some computation. I have a local variable (an array) I allocate and on which I do the computations. I was expecting this array to be shared (because of the DEFAULT(SHARED) clause), but it seems that it is not the case.
Here is an example of what I am trying to do and that reproduce the error I get:
program main
!$ use OMP_LIB
implicit none
integer, parameter :: nx=10, ny=10
real(8), dimension(:,:), allocatable :: array
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP SINGLE
allocate(array(nx,ny))
!$OMP END SINGLE
!$OMP WORKSHARE
array = 1.
!$OMP END WORKSHARE
call compute(array,nx,ny)
!$OMP SINGLE
deallocate(array)
!$OMP END SINGLE
!$OMP END PARALLEL
contains
!=============================================================================
! SUBROUTINES
!=============================================================================
subroutine compute(array, nx, ny)
!$ use OMP_LIB
implicit none
real(8), dimension(nx,ny) :: array
integer :: nx, ny
real(8), dimension(:,:), allocatable :: q
integer :: i, j
!$OMP SINGLE
allocate(q(nx,ny))
!$OMP END SINGLE
!$OMP WORKSHARE
q = 0.
!$OMP END WORKSHARE
print*, 'q before: ', q(1,1)
!$OMP DO SCHEDULE(RUNTIME)
do j = 1, ny
do i = 1, nx
if(mod(i,j).eq.0) then
q(i,j) = array(i,j)*2.
else
q(i,j) = array(i,j)*0.5
endif
end do
end do
!$OMP END DO
print*, 'q after: ', q(1,1)
!$OMP SINGLE
deallocate(q)
!$OMP END SINGLE
end subroutine compute
!=============================================================================
end program main
When I execute it like that, I get a segmentation fault, because the local array q is allocated on one thread but not on the others, and when the others try to access it in memory, it crashes.
If I get rid of the SINGLE region the local array q is allocated (though sometimes it crashes, which make sense, if different threads try to allocate it whereas it is already the case (and actually it puzzles me why it does not crash everytime)) but then it is clearly as if the array q is private (therefore one thread returns me the expected value, whereas the others return me something else).
It really puzzled me why the q array is not shared although I declared my parallel region with the clause DEFAULT(SHARED). And since I am in an orphaned subroutine, I cannot declare explicitely q as shared, since it is known only in the subroutine compute... I am stuck with this problem so far, I could not find a workaround.
Is it normal? Should I expect this behaviour? Is there a workaround? Do I miss something obvious?
Any help would be highly appreciated!
q is an entity that is "inside a region but not inside a construct" in terms of OpenMP speak. The subroutine that q is local to is in a procedure that is called during a parallel construct, but q itself does not lexically appear in between the PARALLEL and END PARALLEL directives.
The data sharing rules for such entities in OpenMP then dictate that q is private.
The data sharing clauses such as DEFAULT(SHARED), etc only apply to things that appear in the construct itself (things that lexically appear in between the PARALLEL and END PARALLEL). (They can't apply to things in the region generally - procedures called in the region may have been separately compiled and might be called outside of any parallel constructs.)
The array q is defined INSIDE the called subroutine. Every thread calls this subroutine independently and therefore every thread will have it's own copy. The shared directive in the outer subroutine cannot change this. Try to declare it with the save attribute.