forrtl: severe (151): allocatable array is already allocated- - fortran

/var/spool/torque/mom_priv/jobs/775.head.cluster.SC: line 22: 28084 Segmentation fault ./a.out
I am new in Fortran and this is the first time I work with HPC and OpenMP.
In my code, I have a loop that should be parallel. I use some dynamic variables that all of them are dummy in the parallel loop.
I allocate the dynamic variables in parallel loop
!$OMP PARALLEL DO
do 250 iconf = 1,config
allocate(randx(num),randy(num),randz(num),unit_cg(num), &
& x(nfftdim1),y(nfftdim2),z(nfftdim3),fr1(num), &
& fr2(num),fr3(num),theta1(order,num), &
& theta2(order,num),theta3(order,num), &
& Q(nfftdim1,nfftdim2,nfftdim3))
... call some subroutines and do calculations ...
deallocate(randx,randy,randz,unit_cg,fr1,fr2,fr3,theta1,theta2, &
& theta3,x,y,z,Q)
250 continue
!$OMP END PARALLEL DO
I omited some irrelevant part of code. When the program is executed, this error occurs:
forrtl: severe (151): allocatable array is already allocated
I allocated the variables outside the parallel region, it works for small data, but for large data this error occurs:
/var/spool/torque/mom_priv/jobs/775.head.cluster.SC: line 22: 28084 Segmentation fault ./a.out
I used PRIVATE clause for dynamic variables (dummy variables):
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(randx,randy,randz, &
!$OMP& unit_cg,fr1,fr2,fr3,theta1,theta2,theta3,x,y,z,Q, &
!$OMP& dir_ene,rec_ene,corr_ene,energy_final,plproduct_avg, &
!$OMP& correlation_term)
and allocated variables inside parallel loop, but the same error,
at last I changed the code to:
allocate(randx(num),randy(num),randz(num),unit_cg(num), &
& x(nfftdim1),y(nfftdim2),z(nfftdim3),fr1(num), &
& fr2(num),fr3(num),theta1(order,num), &
& theta2(order,num),theta3(order,num), &
& Q(nfftdim1,nfftdim2,nfftdim3))
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(randx,randy,randz, &
!$OMP& unit_cg,fr1,fr2,fr3,theta1,theta2,theta3,x,y,z,Q, &
!$OMP& dir_ene,rec_ene,corr_ene,energy_final,plproduct_avg, &
!$OMP& correlation_term)
do 250 iconf = 1,config
... call some subroutines and do calculations ...
250 continue
!$OMP END PARALLEL DO
deallocate(randx,randy,randz,unit_cg,fr1,fr2,fr3,theta1,theta2, &
& theta3,x,y,z,Q)
it fails at run-time. it starts N (number of thread) loops, but can not complete them, and again this error:
/var/spool/torque/mom_priv/jobs/775.head.cluster.SC: line 22: 28084 Segmentation fault ./a.out
any idea?

I changed the code and finally it WORKS!
The directive !$OMP PARALLEL DO is the shortcut of two directives !$OMP PARALLEL and !$OMP DO. I used these two directives (instead of !$OMP PARALLEL DO) and put allocation inside parallel region. I guess (but I'm not sure), now the compiler knows how to get memories for private variables, because I put private clause before allocation and so the segmentation fault dose not occur.
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(iconf,d,randx, &
!$OMP& randy,randz,unit_cg,theta1,theta2,theta3,fr1,fr2,fr3,Q, &
!$OMP& plproduct_avg)
allocate(randx(num),randy(num),randz(num),unit_cg(num), &
& fr1(num),fr2(num),fr3(num),theta1(order,num), &
& theta2(order,num),theta3(order,num), &
& Q(nfftdim1,nfftdim2,nfftdim3))
!$OMP DO
do 250 iconf = 1,config
... call some subroutines and do calculations ...
250 continue
!$OMP END DO
deallocate(randx,randy,randz,unit_cg,fr1,fr2,fr3,theta1,theta2, &
& theta3,Q)
!$OMP END PARALLEL

Related

OpenMP parallelization of two do-loops with condition on indexes

I've tried to parallelize a code contains such a double do-loop. It's not efficient for sure, but that's not a big problem now.
The output tauv is NaN. That is the first problem.
The second problem is that Intel compiler gives fatal error with number of threads less than maximum number of threads (equals 8 for my machine).
How could I treat those problems?
!$omp parallel do private(i,j, ro11,ro21,ro12,ro22, &
u11,u21,u12,u22, &
v11,v21,v12,v22, &
es11,es21,es12,es22, &
p11,p21,p12,p22, &
te11,te21,te12,te22, &
emu11,emu21,emu12,emu22) &
shared(i1l, i2l, j1l, j2l, emumax, tauv, tauvij, ro, u, v, es)
do i=i1l+2,i2l-2,2
do j=j1l+2,j2l-2,2
if (i.le.niii.and.i.ge.0.and.j.ge.0.and.j.le.nj.or.&
i.le.ni.and.i.ge.niik.and.j.gt.njjv.and.j.le.nj.or.&
i.le.ni.and.i.ge.niik.and.j.ge.0.and.j.lt.njjn&
.or.i.gt.niii.and.i.lt.niik.and.j.gt.njj0+i-niii&
.or.i.gt.niii.and.i.lt.niik.and.j.lt.njj0-i+niii) then
ro11=ro(i-1,j-1)
ro21=ro(i+1,j-1)
ro12=ro(i-1,j+1)
ro22=ro(i+1,j+1)
u11=u(i-1,j-1)
u21=u(i+1,j-1)
u12=u(i-1,j+1)
u22=u(i+1,j+1)
v11=v(i-1,j-1)
v21=v(i+1,j-1)
v12=v(i-1,j+1)
v22=v(i+1,j+1)
es11=es(i-1,j-1)
es21=es(i+1,j-1)
es12=es(i-1,j+1)
es22=es(i+1,j+1)
p11=(es11-0.5*ro11*(u11*u11+v11*v11))*ga1
p21=(es21-0.5*ro21*(u21*u21+v21*v21))*ga1
p12=(es12-0.5*ro12*(u12*u12+v12*v12))*ga1
p22=(es22-0.5*ro22*(u22*u22+v22*v22))*ga1
te11=p11/ro11
te21=p21/ro21
te12=p12/ro12
te22=p22/ro22
emu11=te11**1.5*(1.0+s1)/(te11+s1)
emu21=te21**1.5*(1.0+s1)/(te21+s1)
emu12=te12**1.5*(1.0+s1)/(te12+s1)
emu22=te22**1.5*(1.0+s1)/(te22+s1)
emumax=emu11
if (emu21.gt.emumax) then
emumax=emu21
end if
if (emu12.gt.emumax) then
emumax=emu12
end if
if (emu22.gt.emumax) then
emumax=emu22
end if
tauvij=re*flkv*hx*hx/emumax
if (tauvij .le. tauv) then
tauv=tauvij
endif
endif
enddo
enddo
!$omp end parallel do
The thing is that it executes without error, but OpenMP do-loop computes more slowly than sequential one...
From your reproducible example:
1.) Your code is only using 1 thread (?) in OpenMP region:
! Set number of threads
nthreads = 1
call omp_set_num_threads(nthreads)
print *, 'The number of threads are used is ', omp_get_max_threads ( )
I would avoid the call omp_set_num_threads(). Insted, specify number of threads with environmental variable OMP_NUM_THREADS. For unix machine: export OMP_NUM_THREADS=<number of threads>
2.) In your "reproducible" example, the parallelized loop (line 312) is missing private/shared declarations? From what you wrote above, fix to:
!$omp parallel do default(private) shared(i1l, i2l, j1l, j2l, emumax, tauv, tauvij, ro, u, v, es)
With all of the above, the result I get from my machine (4c/4t) using GNU Fortran compiler is:
...
Executed time in SEQ code is 60.2720146
...
Executed time in OMP code is 27.1342430

Thread Segmentation fault when calling function in loop with OpenMP

I'm trying to use OpenMP in Fortran 90 to parallelize a do loop with function call inside. The code listed first runs fine. The code listed next does not. I receive a segmentation fault.
First program: $ gfortran -O3 -o output -fopenmp OMP10.f90
program OMP10
!$ use omp_lib
IMPLICIT NONE
integer, parameter :: n = 100000
integer :: i
real(kind = 8) :: sum,h,x(0:n),f(0:n),ZBQLU01
!$ call OMP_set_num_threads(4)
h = 2.d0/dble(n)
!$OMP PARALLEL DO PRIVATE(i)
do i = 0,n
x(i) = -1.d0+dble(i)*h
f(i) = 2.d0*x(i)
end do
!$OMP END PARALLEL DO
sum = 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:SUM)
do i = 0,n-1
sum = sum + h*f(i)
end do
!$OMP END PARALLEL DO
write(*,*) "The integral is ", sum
end program OMP10
Second program: $ gfortran -O3 -o output -fopenmp randgen.f OMP10.f90
program OMP10
!$ use omp_lib
IMPLICIT NONE
integer, parameter :: n = 100000
integer :: i
real(kind = 8) :: sum,h,x(0:n),f(0:n),ZBQLU01
!$ call OMP_set_num_threads(4)
h = 2.d0/dble(n)
!$OMP PARALLEL DO PRIVATE(i)
do i = 0,n
x(i) = ZBQLU01(0.d0)
end do
!$OMP END PARALLEL DO
sum = 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:SUM)
do i = 0,n-1
sum = sum + h*f(i)
end do
!$OMP END PARALLEL DO
write(*,*) "The integral is ", sum
end program OMP10
In the above command, randgen.f is a library that contains the function ZBQLU01.
You cannot just call any function from a parallel region. The function must be thread safe. See What is meant by "thread-safe" code? and https://en.wikipedia.org/wiki/Thread_safety .
Your function is quite the opposite of thread safe as is quite typical for random number generators. Just notice the SAVE statements in the source code for many local variables and for a common block.
The solution is to use a good parallel random number generator. The site is not for software recommendation, but as a pointer just search the web for "parallel prng" or "parallel random number generator". I personally use a library which I already pointed to in https://stackoverflow.com/a/38263032/721644 A simple web search reveals another simple possibility in https://jblevins.org/log/openmp . And then there are many larger and more complex libraries.

how to use the private variables's value outside the parallel region

I have a parallel fortran code with OpenMP. This is the parallel part of code:
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(iconf,d,randx, &
!$OMP& randy,randz,unit_cg,theta1,theta2,theta3,fr1,fr2,fr3,Q, &
!$OMP& plproduct_avg,correlation_term)
allocate(randx(num),randy(num),randz(num),unit_cg(num), &
& fr1(num),fr2(num),fr3(num),theta1(order,num), &
& theta2(order,num),theta3(order,num), &
& Q(nfftdim1,nfftdim2,nfftdim3))
!$OMP DO
do 250 iconf = 1,1600
write(6,*)'configuration number',iconf
{some calculations}
do 350 d = 0,int(nfft1/2)
write(6,*)'correlation term of iconf =',iconf,'d=',d, &
& correlation_term(d,iconf)
350 continue
250 continue
!$OMP END DO
deallocate(randx,randy,randz,unit_cg,fr1,fr2,fr3,theta1,theta2, &
& theta3,Q)
!$OMP END PARALLEL
As it is clear, this loop calculates the correlation_term(iconf) (they are private) and prints them on output file correctly. But when I use these variables outside the parallel region, all of them are zero.
How can I use the values of correlation_term(iconf) calculated in parallel region, outside it?

What is the best way to reduce an array of arrays using OpenMP?

I am using OpenMP with Fortran. I have boiled down my use case to a very simple example. I have an array of objects with a custom derived type, and each object contains an array with a different size. I want to make sure that whatever happens in the loop, I apply a reduction to all the values array components of the vector objects:
program main
implicit none
integer :: i
type vector
real,allocatable :: values(:)
end type vector
type(vector) :: vectors(3)
allocate(vectors(1)%values(3))
vectors(1)%values = 0
allocate(vectors(2)%values(6))
vectors(2)%values = 0
allocate(vectors(3)%values(9))
vectors(3)%values = 0
!$OMP PARALLEL REDUCTION(+:vectors%values)
!$OMP DO
do i=1,1000
vectors(1)%values = vectors(1)%values + 1
vectors(2)%values = vectors(2)%values + 2
vectors(3)%values = vectors(3)%values + 3
end do
!$OMP END DO
!$OMP END PARALLEL
print*,sum(vectors(1)%values)
print*,sum(vectors(2)%values)
print*,sum(vectors(3)%values)
end program main
In this case, REDUCTION(+:vectors%values) doesn't work because I get the following errors:
test2.f90(22): error #6159: A component cannot be an array if the encompassing structure is an array. [VALUES]
!$OMP PARALLEL REDUCTION(+:vectors%values)
-------------------------------------^
test2.f90(22): error #7656: Subobjects are not allowed in this OpenMP* clause; a named variable must be specified. [VECTORS]
!$OMP PARALLEL REDUCTION(+:vectors%values)
-----------------------------^
compilation aborted for test2.f90 (code 1)
I tried overloading the meaning of + for the vector type and then specifying REDUCTION(+:vectors), but then I still get:
test.f90(43): error #7621: The data type of the variable is not defined for the operator or intrinsic specified on the OpenMP* REDUCTION clause. [VECTORS]
!$OMP PARALLEL REDUCTION(+:vectors)
-----------------------------^
What is the recommended way to deal with derives types such as these and getting the reduction to work?
Just for reference, the correct output when compiling without OpenMP is
3000.000
12000.00
27000.00
This is not just OpenMP problem, you cannot reference vectors%values as a one entity if values is an allocatable array component because rules of Fortran 2003 forbid this. That is because such an array would not have any regular strides in memory, the allocatable components are stored at random adresses.
If the number of elements of the encompassing array is small you can do
!$OMP PARALLEL REDUCTION(+:vectors(1)%values,vectors(2)%values,vectors(3)%values)
!$OMP DO
do i=1,1000
vectors(1)%values = vectors(1)%values + 1
vectors(2)%values = vectors(2)%values + 2
vectors(3)%values = vectors(3)%values + 3
end do
!$OMP END DO
!$OMP END PARALLEL
otherwise you must make another loop, let's say j and make the reduce just vectors(j)%values.
If the compiler does not accept structure components in the reduction clause (have to study the latest standard to see if it hasn't been relaxed), you can make a workaround
!$OMP PARALLEL
do j = 1, size(vectors)
call aux(vectors(j)%values)
end do
!$OMP END PARALLEL
contains
subroutine aux(v)
real :: v(:)
!$OMP DO REDUCTION(+:v)
do i=1,1000
v = v + j
end do
!$OMP END DO
end subroutine
Associate or pointers would be simpler, but they are not allowed either.
As an alternative to Vladimir's answer, you can always implement your own reduction using a temporary array and a critical section:
program main
implicit none
integer :: i
type vector
real,allocatable :: values(:)
end type vector
type(vector) :: vectors(3)
type(vector),allocatable :: tmp(:)
allocate(vectors(1)%values(3))
vectors(1)%values = 0
allocate(vectors(2)%values(6))
vectors(2)%values = 0
allocate(vectors(3)%values(9))
vectors(3)%values = 0
!$OMP PARALLEL PRIVATE(TMP)
! Use a temporary array to hold the local sum
allocate( tmp(size(vectors)) )
do i=1,size(tmp)
allocate( tmp(i)%values( size(vectors(i)%values )) )
tmp(i)%values = vectors(i)%values
enddo ! i
!$OMP DO
do i=1,1000
tmp(1)%values = tmp(1)%values + 1
tmp(2)%values = tmp(2)%values + 2
tmp(3)%values = tmp(3)%values + 3
end do
!$OMP END DO
! Get the global sum one thread at a time
!$OMP CRITICAL
vectors(1)%values = vectors(1)%values + tmp(1)%values
vectors(2)%values = vectors(2)%values + tmp(2)%values
vectors(3)%values = vectors(3)%values + tmp(3)%values
!$OMP END CRITICAL
deallocate(tmp)
!$OMP END PARALLEL
print*,sum(vectors(1)%values)
print*,sum(vectors(2)%values)
print*,sum(vectors(3)%values)
end program main
This snippet could be arranged more efficiently by a loop over all elements of vectors. Then, tmp could be a scalar.

Understanding the correct use of !$omp parallel do reduction(...)

I am trying to write a program that counts the number of primes between 1 and some number n in Fortran 90 utilizing OpenMP. The nested loop just counts the numbers that are not prime. I want to use an omp parallel do to speed this up. As far as I understand, since I am just counting numbers that are not prime, it is appropriate to just use something like !$omp parallel do reduction(+:not_primes). When I run the code below in serial without the !$omp lines I get the following output
Primes: 5134
OpenMP time elapsed 0.49368596076965332
but when I include the !$omp lines I get
Primes: -1606400834
OpenMP time elapsed 0.37933206558227539
Have I used the parallel do correctly here? (apparently not, but why?) Thanks!
program prime_counter
integer n, not_primes, i, j
real*8 :: ostart,oend, omp_get_wtime
ostart = omp_get_wtime()
n=50000
!$omp parallel do reduction(+:not_primes)
do i=2,n
do j=2,i-1
if(mod(i,j)==0) then
not_primes= not_primes+1
exit
end if
end do
end do
!$omp end parallel do
print*, 'Primes:', n-not_primes
oend = omp_get_wtime()
write(*,*) 'OpenMP time elapsed', oend-ostart
end program
You do not initialize not_primes anywhere, it is undefined. The usage of the OpenMP reduction is OK. The index j should be marked as private, I normally mark all indexes as private, but that is not necessary.
not_primes = 0
!$omp parallel do reduction(+:not_primes) private(i,j)