Passing 3d arrays using MPI_Bcast - fortran

I'm trying to pass 3D arrays to all other processes (in FORTRAN 77) using MPI_Bcast. v1 is a common block array. I'm also not sure if I need to broadcast the calculated values of the common array v1 to all other processes or they will be changed in each processes because of being common. The following is the related piece of code:
parameter (nprocz=48,nzro=1)
do i=i101,i102
dist = 0.015*float(i-iv0)
adamp = exp(-dist*dist)
do j = je0, je1-1
do k = ke0, ke1
v1(k,j,i) = v1(k,j,i)*adamp
end do
end do
end do
nmpi01=floor((iv0-ie0-nzro)/(nprocz-1))
if (mpirank .le. nprocz-2) then
i101=ie0+(mpirank*nmpi01)
i102=ie0+(mpirank+1)*nmpi01-1
else
i101=ie0+(mpirank*nmpi01)
i102=iv0-1
endif
MPI_Bcast(v1(:,:,i101:i102),(ke1-ke0+1)*(je1-je0)*(i102-i101+1)
& ,MPI_FLOAT,mpirank,MPI_COMM_WORLD,ierr01)
I get the error message:
PGFTN-S-0081-Matrix/vector v1 illegal as subprogram argument
The sizes of the arrays being passed in are correct. Any comment?
I corrected the code and I looped over the ranks and compute all elements of rcount and displs in each rank:
integer :: myscount, myi101
do rank = 0, nprocz-1
nmpi01=floor((iv0-ie0-nzro)/(nprocz-1))
if (rank .le. nprocz-2) then
i101=ie0+(rank*nmpi01)
i102=ie0+(rank+1)*nmpi01-1
else
i101=ie0+(rank*nmpi01)
i102=iv0-1
endif
scount=(i102-i101+1)*(je1-je0)*(ke1-ke0+1)
rcount(rank+1)=scount
displs(rank+1)=rank*scount+1
if (rank .eq. mpirank) then
myscount = scount
myi101 = i101
end if
end do
scount = myscount
i101 = myi101
call mpi_allgatherv(...)
But still wrong results. 1-in my case, results at each part are used for the next part, especially after mpi_allgatherv.so do i need to add mpi_barrier after each mpi_allgatherv? 2-should mpi_in_place be used? consider i have only one 3d array v1 that each sub-array v1(1,1,i) is calculated by some process and i want to put the calculated subarray in the appropriate part of the same array. 3- i guess i should have displs(i) = sum(rcount(1:i-1))+1 for i=>2 considering that always displs(1)=1 in fortran77. so i corrected to this: before the loop displs(1)=1, inside the loop displs(rank+2)=rank*scount+1 and after the loop displs(nprocz+1)=0. am I right?

As I recall, Fortran 77 was more restrictive about array subscripts than Fortran 90, and pgftn is a Fortran 77 compiler. I would try passing v1(1,1,i101) to mpi_bcast, not v1(:,:,i101:i102). (Or use pgf95 with the "-Mfixed" flag.)
If each process needs to see v1, then you do need to communicate it using MPI. No variable is shared between MPI tasks, not even those in a common block. However, if every process is calculating a different part of v1, so every process needs a piece from every other process, you can't use mpi_bcast to do that; use mpi_allgather instead.
Also, as noted above, when you use MPI procedures, you should call them, because they are subroutines.

Related

How to properly use OpenMP for this nested loop (Fortran)

I have recently started to parallelize a serial code I've been developing and was curious if anyone had input on how to properly apply OpenMP for these loops.
F_vol = 0.0_bp
G_vol = 0.0_bp
H_vol = 0.0_bp
RHS_vol = 0.0_bp
!$OMP PARALLEL DO PRIVATE(e,i1,j1,F,G,H)
DO e=1,NE
DO i1=1,Np
DO j1=1,NPTS
CALL Flux(Q(j1,e,:),F,G,H)
F_Vol(i1,e,:) = F_Vol(i1,e,:) + Stuff(j1)*F(:)
G_Vol(i1,e,:) = G_vol(i1,e,:) + Stuff(j1)*G(:)
H_Vol(i1,e,:) = H_vol(i1,e,:) + Stuff(j1)*H(:)
END DO
END DO
END DO
!$OMP END PARALLEL DO
As a note, the arrays, F, G, and H are of size 5 and are temporary arrays. Additionally, F_Vol,G_Vol,H_Vol is of dimension (NE,Np,5) The part I am unsure on is, how to properly parallelize the arrays I sum from j1=1,NPTS. Given that they are not dependent on each other but vary between i1,e, I think using PRIVATE() is required. as to avoid overwriting. Lastly, I am fixated on these loops as according to GProf, a good portion of my computational expense is in this area of code.

Parallelization of Intel Fortran code for allocatable type containing pointer arrays using OpenMP

I am trying parallelization of a part of code containing nested do loop. There is a 'READ' operation within the nested loop. I am trying to use openMP to reduced the wall time for computation.
I have a type which contains allocatable pointer. I'm not sure how to handle the error message I'm getting "attempt to use pointer CellArr when it is not associated with a target" when I'm trying to use
P_ph(iph,iel)%cellArr(igp)%arr outside this nested loop.
OPEN (24, FILE=TRIM(ADJUSTL(InpFile))//"_GSP.dat", STATUS='OLD', ACTION='READ', &
ACCESS='DIRECT', FORM='FORMATTED', RECL=600*nel)
!$omp parallel private(id,iel,iph,igp,igp1,tmpP,lineNo,isegel,iSegEls,ngp,ngp1,P_ph, &
isegelTmp, igpTmp, phPolVec, integrand) shared(ShapeFunc_P, ElConn, nph)
id=omp_get_thread_num()
!$omp do
DO iel = 1, nel
ngp = elConn(iel)%ngp
DO iph = 1, nph
ALLOCATE( P_ph(iph, iel)%cellArr(ngp) )
END DO
DO igp = 1, ngp
lineNo = SUM( elConn(1:iel-1)%ngp ) + igp
READ(24,FMT=101,REC=lineNo ) isegelTmp, igpTmp, phPolVec
DO iph = 1, nph
ALLOCATE( P_ph(iph,iel)%cellArr(igp)%arr(ndim,ndim) )
tmpP = 0.d0
DO isegels = 1, Seg_P(iph)%segSize
isegel = Seg_P(iph)%els(isegels)
ngp1 = elConn(isegel)%ngp
ALLOCATE( integrand(ngp1) )
!Retrieve the PhP function from .dat file
phP = RESHAPE( SOURCE = phPolVec((isegel-1)*ndim*ndim+ &
1:isegel*ndim*ndim ),SHAPE=(/ndim,ndim/) ) / elConn(isegel)%vol
DO igp1 = 1, ngp1
ALLOCATE( integrand(igp1)%arr(ndim,ndim) )
integrand(igp1)%arr = phP*ShapeFunc_P(isegel)
END DO
CALL INTEGRAL( tmpP, integrand, elConn(isegel)%jacobian, ngp, nsd, ndim)
DO igp1=1, ngp1
DEALLOCATE( integrand(igp1)%arr )
END DO
DEALLOCATE(integrand)
END DO
P_ph(iph,iel)%cellArr(igp)%arr = tmpP
END DO
END DO
END DO
!$omp end do
!$omp end parallel
CLOSE (24)
The types are as follows:
TYPE CELL
REAL*8, POINTER :: arr(:,:)
END TYPE CELL
TYPE CELL2
TYPE (CELL), POINTER :: CellArr(:)
END TYPE CELL2
TYPE (CELL2) :: P_ph(nph, nel)
This code works fine as a sequential program.
Does it make sense that any thread could act on an arbitrary record in the file on unit 24 ?
If it does it would probably be better to place the read in a !$OMP CRITICAL region.
I also notice that the file is FORMATTED and DIRECT, with access via REC=lineNo and RECL=600*nel. (Unusual record size, is this running ? File size is 600*nel * sum(elConn(1:nel)%ngp), which looks very big order(nel^2).
It may be better to create this information as a shared derived type array of (isegelTmp, igpTmp, phPolVec), before entering !$OMP region and then process "randomly" from any thread. (No indication of the type or size of these 3 components.)
What is the record id : SUM( elConn(1:iel-1)%ngp ) + igp? Does it vary while processing (probably not)? Perhaps better to also create a shared index for the first record of each "iel", before entering !OMP region and use this to define the work for each "iel" by any thread.
Where is all the information for each "iel" ? on another shared direct access file ?
I have not answered the question if reading a direct access file randomly by multiple threads is thread-safe? I have not tried this, but !$OMP CRITICAL would be a minimum. You could try a test. (lots of disk buffer clashes) Much safer to create a shared in-memory data structure first. Hopefully each iel processing time is much longer than the reading time.
Where do the processed results go ? to the same direct access file ? Kicking the problem down the road ?
This looks like a result processing loop. In my analysis, I have not moved this to !$OMP, as this result processing tends to be much quicker that the results calculation phase. With 64-bit, I have certainly moved the generated results to memory, rather than process from disk.

The sum function returns answers different from an explicit loop

I am converting f77 code to f90 code, and part of the code needs to sum over elements of a 3d matrix. In f77 this was accomplished by using 3 loops (over outer,middle,inner indices). I decided to use the f90 intrinsic sum (3 times) to accomplish this, and much to my surprise the answers differ. I am using the ifort compiler, have debugging, check-bounds, no optimization all turned on
Here is the f77-style code
r1 = 0.0
do k=1,nz
do j=1,ny
do i=1,nx
r1 = r1 + foo(i,j,k)
end do
end do
end do
and here is the f90 code
r = SUM(SUM(SUM(foo, DIM=3), DIM=2), DIM=1)
I have tried all sorts of variations, such as swapping the order of the loops for the f77 code, or creating temporary 2D matrices and 1D arrays to "reduce" the dimensions while using SUM, but the explicit f77 style loops always give different answers from the f90+ SUM function.
I'd appreciate any suggestions that help understand the discrepancy.
By the way this is using one serial processor.
Edited 12:13 pm to show complete example
! ifort -check bounds -extend-source 132 -g -traceback -debug inline-debug-info -mkl -o verify verify.f90
! ./verify
program verify
implicit none
integer :: nx,ny,nz
parameter(nx=131,ny=131,nz=131)
integer :: i,j,k
real :: foo(nx,ny,nz)
real :: r0,r1,r2
real :: s0,s1,s2
real :: r2Dfooxy(nx,ny),r1Dfoox(nx)
call random_seed
call random_number(foo)
r0 = 0.0
do k=1,nz
do j=1,ny
do i=1,nx
r0 = r0 + foo(i,j,k)
end do
end do
end do
r1 = 0.0
do i=1,nx
do j=1,ny
do k=1,nz
r1 = r1 + foo(i,j,k)
end do
end do
end do
r2 = 0.0
do j=1,ny
do i=1,nx
do k=1,nz
r2 = r2 + foo(i,j,k)
end do
end do
end do
!*************************
s0 = 0.0
s0 = SUM(SUM(SUM(foo, DIM=3), DIM=2), DIM=1)
s1 = 0.0
r2Dfooxy = SUM(foo, DIM = 3)
r1Dfoox = SUM(r2Dfooxy, DIM = 2)
s1 = SUM(r1Dfoox)
s2 = SUM(foo)
!*************************
print *,'nx,ny,nz = ',nx,ny,nz
print *,'size(foo) = ',size(foo)
write(*,'(A,4(ES15.8))') 'r0,r1,r2 = ',r0,r1,r2
write(*,'(A,3(ES15.8))') 'r0-r1,r0-r2,r1-r2 = ',r0-r1,r0-r2,r1-r2
write(*,'(A,4(ES15.8))') 's0,s1,s2 = ',s0,s1,s2
write(*,'(A,3(ES15.8))') 's0-s1,s0-s2,s1-s2 = ',s0-s1,s0-s2,s1-s2
write(*,'(A,3(ES15.8))') 'r0-s1,r1-s1,r2-s1 = ',r0-s1,r1-s1,r2-s1
stop
end
!**********************************************
sample output
nx,ny,nz = 131 131 131
size(foo) = 2248091
r0,r1,r2 = 1.12398225E+06 1.12399525E+06 1.12397238E+06
r0-r1,r0-r2,r1-r2 = -1.30000000E+01 9.87500000E+00 2.28750000E+01
s0,s1,s2 = 1.12397975E+06 1.12397975E+06 1.12398225E+06
s0-s1,s0-s2,s1-s2 = 0.00000000E+00-2.50000000E+00-2.50000000E+00
r0-s1,r1-s1,r2-s1 = 2.50000000E+00 1.55000000E+01-7.37500000E+00
First, welcome to StackOverflow. Please take the tour! There is a reason we expect a Minimal, Complete, and Verifiable example because we look at your code and can only guess at what might be the case and that is not too helpful for the community.
I hope the following suggestions helps you figure out what is going on.
Use the size() function and print what Fortran thinks are the sizes of the dimensions as well as printing nx, ny, and nz. As far as we know, the array is declared bigger than nx, ny, and nz and these variables are set according to the data set. Fortran does not necessarily initialize arrays to zero depending on whether it is a static or allocatable array.
You can also try specifying array extents in the sum function:
r = Sum(foo(1:nx,1:ny,1:nz))
If done like this, at least we know that the sum function is working on the exact same slice of foo that the loops loop over.
If this is the case, you will get the wrong answer even though there is nothing 'wrong' with the code. This is why it is particularly important to give that Minimal, Complete, and Verifiable example.
I can see the differences now. These are typical rounding errors from adding small numbers to a large sum. The processor is allowed to use any order of the summation it wants. There is no "right" order. You cannot really say that the original loops make the "correct" answer and the others do not.
What you can do is to use double precision. In extreme circumstances there are tricks like the Kahan summation but one rarely needs that.
Addition of a small number to a large sum is imprecise and especially so in single precision. You still have four significant digits in your result.
One typically does not use the DIM= argument, that is used in certain special circumstances.
If you want to sum all elements of foo, use just
s0 = SUM(foo)
That is enough.
What
s0 = SUM(SUM(SUM(foo, DIM=3), DIM=2), DIM=1)
does is that it will make a temporary 2D arrays with each element be the sum of the respective row in the z dimension, then a 1D array with each element the sum over the last dimension of the 2D array and then finally the sum of that 1D array. If it is done well, the final result will be the same, but it well eat a lot of CPU cycles.
The sum intrinsic function returns a processor-dependant approximation to the sum of the elements of the array argument. This is not the same thing as adding sequentially all elements.
It is simple to find an array x where
summation = x(1) + x(2) + x(3)
(performed strictly left to right) is not the best approximation for the sum treating the values as "mathematical reals" rather than floating point numbers.
As a concrete example to look at the nature of the approximation with ifort, we can look at the following program. We need to enable optimizations here to see effects; the importance of order of summation is apparent even with optimizations disabled (with -O0 or -debug).
implicit none
integer i
real x(50)
real total
x = [1.,(EPSILON(0.)/2, i=1, SIZE(x)-1)]
total = 0
do i=1, SIZE(x)
total = total+x(i)
print '(4F17.14)', total, SUM(x(:i)), SUM(DBLE(x(:i))), REAL(SUM(DBLE(x(:i))))
end do
end program
If adding up in strict order we get 1., seeing that anything smaller in magnitude than epsilon(0.) doesn't affect the sum.
You can experiment with the size of the array and order of its elements, the scaling of the small numbers and the ifort floating point compilation options (such as -fp-model strict, -mieee-fp, -pc32). You can also try to find an example like the above using double precision instead of default real.

Using Persistent Communication in Fortran MPI

I am using persistent communication in my CFD code. I have the communications setup in another subroutine and in the main subroutine, where I have the do loop, I use the MPI_STARTALL(), MPI_WAITALL().
In order to make it shorter, I am showing hte first part of the setup. The rest of the arrays are exactly the same.
My setup subrotuine looks like:
Subroutine MPI_Subroutine
use Variables
use mpi
implicit none
!Starting up MPI
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,npes,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,MyRank,ierr)
!Compute the size of local block (1D Decomposition)
Jmax = JmaxGlobal
Imax = ImaxGlobal/npes
if (MyRank.lt.(ImaxGlobal - npes*Imax)) then
Imax = Imax + 1
end if
if (MyRank.ne.0.and.MyRank.ne.(npes-1)) then
Imax = Imax + 2
Else
Imax = Imax + 1
endif
! Computing neighboars
if (MyRank.eq.0) then
Left = MPI_PROC_NULL
else
Left = MyRank - 1
end if
if (MyRank.eq.(npes -1)) then
Right = MPI_PROC_NULL
else
Right = MyRank + 1
end if
! Initializing the Arrays in each processor, according to the number of local nodes
Call InitializeArrays
!Creating the channel of communication for this computation,
!Sending and receiving the u_old (Ghost cells)
Call MPI_SEND_INIT(u_old(2,:),Jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(1),ierr)
Call MPI_RECV_INIT(u_old(Imax,:),jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(2),ierr)
Call MPI_SEND_INIT(u_old(Imax-1,:),Jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(3),ierr)
Call MPI_RECV_INIT(u_old(1,:),jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(4),ierr)
Since I am debugging my code I am just checking these arrays. When I check my ghost cells are full of zeroes. Then I guess that I messing with the instruction.
The main code, where I call the MPI_STARTALL, MPI_WAITALL looks like:
Program
use Variables
use mpi
implicit none
open(32, file = 'error.dat')
Call MPI_Subroutine
!kk=kk+1
DO kk=1, 2001
! A lot of calculation
! communicating the maximum error among the processes and delta t
call MPI_REDUCE(eps,epsGlobal,1,MPI_DOUBLE_PRECISION,MPI_MAX,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(epsGlobal,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_DOUBLE_PRECISION,MPI_MIN,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) delta_t = delta_tGlobal
call MPI_BCAST(delta_t,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) then
write(*,*) kk,epsGlobal,(kk*delta_t)
write(32,*) kk,epsGlobal
endif
Call Swap
Call MPI_STARTALL(4,req,ierr) !
Call MPI_WAITALL(4,req,status,ierr)
enddo
The variables are set in another module. the MPI related variables looks like:
! MPI variables
INTEGER :: npes, MyRank, ierr, Left, Right, tag
INTEGER :: status(MPI_STATUS_SIZE,4)
INTEGER,dimension(4) :: req
I appreciate your time and suggestion in this problem.

Why doesn't this fortran code work?

Hey, I wrote this (fortran) with the aim of finding the minimum spanning tree of a bunch of points (syscount of them). I know for a fact that this approach works, since i wrote it in javascript earlier today. js is slow though, and i wanted to see how much faster fortran would be!!
only problem is it's not working, i'm getting an annoying error;
prims.f95:72.43:
if((check == 1) .and. (path(nodesin(j))(k) < minpath)) then
1
Error: Expected a right parenthesis in expression at (1)
What the hell is that about?! the 43rd character on the line is the "h" of "path"
nodesin(1) = 1
do i = 1,syscount-1
pathstart = -1
pathend = -1
minpath = 2000
do j = 1,i
do k = 1, syscount
check = 1
do l = 1, i
if(nodesin(l) == k) then
check = 0
end if
end do
if((check == 1) .and. (path(nodesin(j))(k) < minpath)) then
minpath = path(nodesin(j))(k)
pathstart = nodesin(j)
pathend = k
end if
end do
end do
nodesin(i+1) = pathend
minpaths(i)(1) = pathstart
minpaths(i)(2) = pathend
end do
Also, i'm fairly new to fortran, so i have a few other questions;
can i use && instead of .and. ?
is there a versions of the for(object in list){} loop found in many other languages?
is there a verion of the php function in_array ? i.e. bool in_array(needle,haystack), and if there is, is there a better way of doing it than:
check = false
Asize = size(array)
do i = 1, Asize
if(array(i) == needle) then
check = true
end if
end do
then to using the check variable to see if it's there?
(I haven't posted anything on stackoverflow before. please don't get angry if i've broken loads of etiquette things!)
It looks like you have defined path and minpaths as two-dimensional arrays. Multi-dimensional arrays are accessed differently in Fortran when compared to C-like languages. In Fortran you separate the indices by commas within one set of parentheses.
I'm guessing by the use of these variables they are integer arrays. Here is how you access elements of those arrays (since you didn't share your variable declarations I am making up the shape of these arrays):
integer :: path(n1, n2)
integer :: minpaths(n3, 2)
your if statement should be:
if((check == 1) .and. (path(nodesin(j), k) < minpath)) then
your access to minpaths should be:
minpaths(i, 1) = pathstart
minpaths(i, 2) = pathend
Also, if you are not using IMPLICIT NONE I recommend you consider it. Not using it is dangerous, and you are using variable names that are close to each other (minpath and minpaths). You could save hours of hair pulling debugging by using IMPLICIT NONE.
While .EQ. can be replaced with ==, there is still only .AND.
For your code block to check whether a "variable is there", you can use "where" and have much shorter code!
In Fortran >= 90 statements and functions can operate on arrays so that explicit loops don't have to be used as frequently.
There is no for (object in list), but using the where statement can do something very similar.
Many of the intrinsic functions that act on arrays also take masks as optional arguments to selectively operate.
I suggest reading a book to learn about these features. I like the one by Metcalf, Reid and Cohen. In the meantime, the second Wikipedia article may help: http://en.wikipedia.org/wiki/Fortran_95_language_features