slow down of a parallel subroutine when called inside a loop

slow down of a parallel subroutine when called inside a loop - fortran

I have parallelized a subroutine. It have very good benchmark : speedup 4X on a quad core. I have them in two different source: serial.f and paral.f. The comparison is made running them from terminal and printing elapsed wall clock time. Inside each source code there is only call to the associate subroutine. But, when I modify the sources like this :
serial.f :
do i=1,100
call serial
end do
and like this
paral.f :
do i=1,100
call paral
end do
performance goes down to 0.96 X speed: the parallel version is bad than the serial one! The code can be found in why calling many N times a serial subroutine is faster than calling N times the parallel version of the same subroutin
For obtaining the serial.f just comment the block containing the call paral. For obtaining the paral.f just comment the block containing the call serial.
I'm asking : is this a common problem ? How can I solve it to maintain the 4 X speedup maintaning the loop call?
Please note :
(1)I've tried translating to C and timing, benchmarks and problems remains all the same
(2) I've tried translating to modern fortran and timing, benchmarks and problems remains all the same
(3) I've tried all kind of tricks and rewriting of the code. I'm sure the problem is not how the subroutine is parallelized (I achieved 4 X ) but that it is called too many times inside a loop.
Thank you.
EDIT ::
As requested, I'm posting a program written in modern fortran who esibit the same issues :
program main
use omp_lib
implicit none
integer ( kind = 4 ), parameter :: m = 5000
integer ( kind = 4 ), parameter :: n = 5000
integer ( kind = 4 ) i
integer ( kind = 4 ) j
integer ( kind = 4 ) nn
real ( kind = 8 ) u(m,n)
real ( kind = 8 ) w(m,n)
real ( kind = 8 ) wtime,h
call random_seed()
do j=1,n
do i=1,m
call random_number(u(i,j))
end do
end do
wtime = omp_get_wtime ( )
do nn=1,100
!$omp parallel do default(none) shared(u, w) private(i,j)
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) )
end do
end do
!$omp end parallel do
end do
wtime = omp_get_wtime ( ) - wtime
h=0.0D+00
do j=1,n
do i=1,m
h=h+w(i,j)
end do
end do
write ( *, '(a,g14.6)' ) ' Wall clock time serial= ', wtime
write ( *, '(a,g14.6)' ) ' h ', h
stop
end
In order to get serial_with_loop.f90 just comment openmp directives and the nn loop. You must obtain also with a similar method parall_with_loop.f90 and serial and parall without loop. You can compile with " gfortran -o name.out -fopenmp -O3 name.f90 " and launch from terminal with output redirection to text file "name.out > time_result.txt"

The problem you have is that you are parallelizing the loop on j that is located inside a loop on nn. Therefore, for each nn value, your machine needs time to create a pool of threads that do the job for different value of j. Therefore, this time (required for creating the pool) is serial and cannot be devided by the number of used threads. As I see your code, there is no reason for not being able to parallelize the nn loop and creating that pool only once, instead of nn times. I think that your code will work better if you write
wtime = omp_get_wtime ( )
!$omp parallel do default(none) shared(u, w) private(nn,i,j)
do nn=1,100
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1))
end do
end do
end do
!$omp end parallel do
wtime = omp_get_wtime ( ) - wtime
I hope that this helps you.

Related

Array access in async OpenACC kernels

Say I have a Fortran program that performs two tasks on an array: task A computes its mean and task B doubles it. The point is that task B should be independent from task A. When accelerating the program with OpenACC, it would make sense to run the two tasks concurrently by making task A asynchronous:
program test
implicit none
integer, parameter :: n = 1000000
real(8) :: mean
real(8) :: array(n)
real(8) :: array_d(n)
! initialize array
array = [(i, i=1, n)]
!$acc kernels async num_gangs(1)
! Task A: get mean of array
mean = 0d0
!$acc loop independent reduction(+:mean)
do i = 1, n
mean = mean + array(i)
end do
mean = mean / n
!$acc end kernels
!$acc kernels
! Task B: work on array
!$acc loop independent
do i = 1, n
array(i) = array(i) * 2
end do
!$acc end kernels
!$acc wait
!$acc end data
! print array and mean
print "(10(g0.2, x))", array(:10)
print "('mean = ', g0.2)", mean
end program
However, when running the two tasks at the same time, task B will modify the array that task A is reading, leading to incorrect values. On CPU (no acceleration) I get:
2.0 4.0 6.0 8.0 10. 12. 14. 16. 18. 20.
mean = 500000.5000000000
On GPU (using the NVIDIA HPC SDK), I get a different mean which is obviously incorrect:
2.0 4.0 6.0 8.0 10. 12. 14. 16. 18. 20.
mean = 999967.6836640000
Is there an elegant way to "protect" the array being worked by task A?

Parallelization of Intel Fortran code for allocatable type containing pointer arrays using OpenMP

I am trying parallelization of a part of code containing nested do loop. There is a 'READ' operation within the nested loop. I am trying to use openMP to reduced the wall time for computation.
I have a type which contains allocatable pointer. I'm not sure how to handle the error message I'm getting "attempt to use pointer CellArr when it is not associated with a target" when I'm trying to use
P_ph(iph,iel)%cellArr(igp)%arr outside this nested loop.
OPEN (24, FILE=TRIM(ADJUSTL(InpFile))//"_GSP.dat", STATUS='OLD', ACTION='READ', &
ACCESS='DIRECT', FORM='FORMATTED', RECL=600*nel)
!$omp parallel private(id,iel,iph,igp,igp1,tmpP,lineNo,isegel,iSegEls,ngp,ngp1,P_ph, &
isegelTmp, igpTmp, phPolVec, integrand) shared(ShapeFunc_P, ElConn, nph)
id=omp_get_thread_num()
!$omp do
DO iel = 1, nel
ngp = elConn(iel)%ngp
DO iph = 1, nph
ALLOCATE( P_ph(iph, iel)%cellArr(ngp) )
END DO
DO igp = 1, ngp
lineNo = SUM( elConn(1:iel-1)%ngp ) + igp
READ(24,FMT=101,REC=lineNo ) isegelTmp, igpTmp, phPolVec
DO iph = 1, nph
ALLOCATE( P_ph(iph,iel)%cellArr(igp)%arr(ndim,ndim) )
tmpP = 0.d0
DO isegels = 1, Seg_P(iph)%segSize
isegel = Seg_P(iph)%els(isegels)
ngp1 = elConn(isegel)%ngp
ALLOCATE( integrand(ngp1) )
!Retrieve the PhP function from .dat file
phP = RESHAPE( SOURCE = phPolVec((isegel-1)*ndim*ndim+ &
1:isegel*ndim*ndim ),SHAPE=(/ndim,ndim/) ) / elConn(isegel)%vol
DO igp1 = 1, ngp1
ALLOCATE( integrand(igp1)%arr(ndim,ndim) )
integrand(igp1)%arr = phP*ShapeFunc_P(isegel)
END DO
CALL INTEGRAL( tmpP, integrand, elConn(isegel)%jacobian, ngp, nsd, ndim)
DO igp1=1, ngp1
DEALLOCATE( integrand(igp1)%arr )
END DO
DEALLOCATE(integrand)
END DO
P_ph(iph,iel)%cellArr(igp)%arr = tmpP
END DO
END DO
END DO
!$omp end do
!$omp end parallel
CLOSE (24)
The types are as follows:
TYPE CELL
REAL*8, POINTER :: arr(:,:)
END TYPE CELL
TYPE CELL2
TYPE (CELL), POINTER :: CellArr(:)
END TYPE CELL2
TYPE (CELL2) :: P_ph(nph, nel)
This code works fine as a sequential program.

Does it make sense that any thread could act on an arbitrary record in the file on unit 24 ?
If it does it would probably be better to place the read in a !$OMP CRITICAL region.
I also notice that the file is FORMATTED and DIRECT, with access via REC=lineNo and RECL=600*nel. (Unusual record size, is this running ? File size is 600*nel * sum(elConn(1:nel)%ngp), which looks very big order(nel^2).
It may be better to create this information as a shared derived type array of (isegelTmp, igpTmp, phPolVec), before entering !$OMP region and then process "randomly" from any thread. (No indication of the type or size of these 3 components.)
What is the record id : SUM( elConn(1:iel-1)%ngp ) + igp? Does it vary while processing (probably not)? Perhaps better to also create a shared index for the first record of each "iel", before entering !OMP region and use this to define the work for each "iel" by any thread.
Where is all the information for each "iel" ? on another shared direct access file ?
I have not answered the question if reading a direct access file randomly by multiple threads is thread-safe? I have not tried this, but !$OMP CRITICAL would be a minimum. You could try a test. (lots of disk buffer clashes) Much safer to create a shared in-memory data structure first. Hopefully each iel processing time is much longer than the reading time.
Where do the processed results go ? to the same direct access file ? Kicking the problem down the road ?
This looks like a result processing loop. In my analysis, I have not moved this to !$OMP, as this result processing tends to be much quicker that the results calculation phase. With 64-bit, I have certainly moved the generated results to memory, rather than process from disk.

Openmp thread distribution

I am trying to learn openmp ( particularly $omp do ) in Fortran90 and facing a strange problem. I have written a simple code and executing it using 2 processors. I have compiled the program using "gfortran -fopenmp filename.f90".
use omp_lib
implicit none
integer :: i,j,a(2)
write ( *, '(a,i8)' ) &
' The number of processors available = ', omp_get_num_procs ( )
write ( *, '(a,i8)' ) &
' The number of threads available = ', omp_get_max_threads ( )
!$OMP DO
do i = 1, 2
a(i) = i + 1
j = OMP_GET_THREAD_NUM()
print*,"a(i)=",a(i), "j = ",OMP_GET_THREAD_NUM()
enddo
!$OMP END DO
end
The output I am seeing is
The number of processors available = 6
The number of threads available = 2
a(i)= 2 j = 0
a(i)= 3 j = 0
But I was expecting variable j to take up values 0 and 1 for i = 1 and i = 2 respectively. Where am I going wrong? My expectation is met if I use $omp do parallel instead of $omp do.

Thread Segmentation fault when calling function in loop with OpenMP

I'm trying to use OpenMP in Fortran 90 to parallelize a do loop with function call inside. The code listed first runs fine. The code listed next does not. I receive a segmentation fault.
First program: $ gfortran -O3 -o output -fopenmp OMP10.f90
program OMP10
!$ use omp_lib
IMPLICIT NONE
integer, parameter :: n = 100000
integer :: i
real(kind = 8) :: sum,h,x(0:n),f(0:n),ZBQLU01
!$ call OMP_set_num_threads(4)
h = 2.d0/dble(n)
!$OMP PARALLEL DO PRIVATE(i)
do i = 0,n
x(i) = -1.d0+dble(i)*h
f(i) = 2.d0*x(i)
end do
!$OMP END PARALLEL DO
sum = 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:SUM)
do i = 0,n-1
sum = sum + h*f(i)
end do
!$OMP END PARALLEL DO
write(*,*) "The integral is ", sum
end program OMP10
Second program: $ gfortran -O3 -o output -fopenmp randgen.f OMP10.f90
program OMP10
!$ use omp_lib
IMPLICIT NONE
integer, parameter :: n = 100000
integer :: i
real(kind = 8) :: sum,h,x(0:n),f(0:n),ZBQLU01
!$ call OMP_set_num_threads(4)
h = 2.d0/dble(n)
!$OMP PARALLEL DO PRIVATE(i)
do i = 0,n
x(i) = ZBQLU01(0.d0)
end do
!$OMP END PARALLEL DO
sum = 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:SUM)
do i = 0,n-1
sum = sum + h*f(i)
end do
!$OMP END PARALLEL DO
write(*,*) "The integral is ", sum
end program OMP10
In the above command, randgen.f is a library that contains the function ZBQLU01.

You cannot just call any function from a parallel region. The function must be thread safe. See What is meant by "thread-safe" code? and https://en.wikipedia.org/wiki/Thread_safety .
Your function is quite the opposite of thread safe as is quite typical for random number generators. Just notice the SAVE statements in the source code for many local variables and for a common block.
The solution is to use a good parallel random number generator. The site is not for software recommendation, but as a pointer just search the web for "parallel prng" or "parallel random number generator". I personally use a library which I already pointed to in https://stackoverflow.com/a/38263032/721644 A simple web search reveals another simple possibility in https://jblevins.org/log/openmp . And then there are many larger and more complex libraries.

What is the best way to reduce an array of arrays using OpenMP?

I am using OpenMP with Fortran. I have boiled down my use case to a very simple example. I have an array of objects with a custom derived type, and each object contains an array with a different size. I want to make sure that whatever happens in the loop, I apply a reduction to all the values array components of the vector objects:
program main
implicit none
integer :: i
type vector
real,allocatable :: values(:)
end type vector
type(vector) :: vectors(3)
allocate(vectors(1)%values(3))
vectors(1)%values = 0
allocate(vectors(2)%values(6))
vectors(2)%values = 0
allocate(vectors(3)%values(9))
vectors(3)%values = 0
!$OMP PARALLEL REDUCTION(+:vectors%values)
!$OMP DO
do i=1,1000
vectors(1)%values = vectors(1)%values + 1
vectors(2)%values = vectors(2)%values + 2
vectors(3)%values = vectors(3)%values + 3
end do
!$OMP END DO
!$OMP END PARALLEL
print*,sum(vectors(1)%values)
print*,sum(vectors(2)%values)
print*,sum(vectors(3)%values)
end program main
In this case, REDUCTION(+:vectors%values) doesn't work because I get the following errors:
test2.f90(22): error #6159: A component cannot be an array if the encompassing structure is an array. [VALUES]
!$OMP PARALLEL REDUCTION(+:vectors%values)
-------------------------------------^
test2.f90(22): error #7656: Subobjects are not allowed in this OpenMP* clause; a named variable must be specified. [VECTORS]
!$OMP PARALLEL REDUCTION(+:vectors%values)
-----------------------------^
compilation aborted for test2.f90 (code 1)
I tried overloading the meaning of + for the vector type and then specifying REDUCTION(+:vectors), but then I still get:
test.f90(43): error #7621: The data type of the variable is not defined for the operator or intrinsic specified on the OpenMP* REDUCTION clause. [VECTORS]
!$OMP PARALLEL REDUCTION(+:vectors)
-----------------------------^
What is the recommended way to deal with derives types such as these and getting the reduction to work?
Just for reference, the correct output when compiling without OpenMP is
3000.000
12000.00
27000.00

This is not just OpenMP problem, you cannot reference vectors%values as a one entity if values is an allocatable array component because rules of Fortran 2003 forbid this. That is because such an array would not have any regular strides in memory, the allocatable components are stored at random adresses.
If the number of elements of the encompassing array is small you can do
!$OMP PARALLEL REDUCTION(+:vectors(1)%values,vectors(2)%values,vectors(3)%values)
!$OMP DO
do i=1,1000
vectors(1)%values = vectors(1)%values + 1
vectors(2)%values = vectors(2)%values + 2
vectors(3)%values = vectors(3)%values + 3
end do
!$OMP END DO
!$OMP END PARALLEL
otherwise you must make another loop, let's say j and make the reduce just vectors(j)%values.
If the compiler does not accept structure components in the reduction clause (have to study the latest standard to see if it hasn't been relaxed), you can make a workaround
!$OMP PARALLEL
do j = 1, size(vectors)
call aux(vectors(j)%values)
end do
!$OMP END PARALLEL
contains
subroutine aux(v)
real :: v(:)
!$OMP DO REDUCTION(+:v)
do i=1,1000
v = v + j
end do
!$OMP END DO
end subroutine
Associate or pointers would be simpler, but they are not allowed either.

As an alternative to Vladimir's answer, you can always implement your own reduction using a temporary array and a critical section:
program main
implicit none
integer :: i
type vector
real,allocatable :: values(:)
end type vector
type(vector) :: vectors(3)
type(vector),allocatable :: tmp(:)
allocate(vectors(1)%values(3))
vectors(1)%values = 0
allocate(vectors(2)%values(6))
vectors(2)%values = 0
allocate(vectors(3)%values(9))
vectors(3)%values = 0
!$OMP PARALLEL PRIVATE(TMP)
! Use a temporary array to hold the local sum
allocate( tmp(size(vectors)) )
do i=1,size(tmp)
allocate( tmp(i)%values( size(vectors(i)%values )) )
tmp(i)%values = vectors(i)%values
enddo ! i
!$OMP DO
do i=1,1000
tmp(1)%values = tmp(1)%values + 1
tmp(2)%values = tmp(2)%values + 2
tmp(3)%values = tmp(3)%values + 3
end do
!$OMP END DO
! Get the global sum one thread at a time
!$OMP CRITICAL
vectors(1)%values = vectors(1)%values + tmp(1)%values
vectors(2)%values = vectors(2)%values + tmp(2)%values
vectors(3)%values = vectors(3)%values + tmp(3)%values
!$OMP END CRITICAL
deallocate(tmp)
!$OMP END PARALLEL
print*,sum(vectors(1)%values)
print*,sum(vectors(2)%values)
print*,sum(vectors(3)%values)
end program main
This snippet could be arranged more efficiently by a loop over all elements of vectors. Then, tmp could be a scalar.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js