So, I'm currently writing a small scale code that does output using CGNS and adaptive mesh refinement (AMR) with Amrex. This is all being done with Fortran 95, though CGNS is C with Fortran interfaces and Amrex is C++ with Fortran interfaces (those are not in the sample code). I'm using OpenMPI 1.10.7.
This will eventually go into a full CFD code, but I wanted to test it small scale to work the bugs out before putting in the larger code. The program below seems to works every time, but it was originally a subroutine that did not.
I'm having an issue where not all of the data from MPI_Bcast is being received by every process... sometimes. I can hit execute on the same code, twice in a row and sometimes is bombs out (segfault from CGNS elsewhere in the code, and sometimes it works. As far as I can tell, the program bombs when not all of the data from MPI_Bcast is received in time to start work elsewhere. Despite MPI_wait and MPI_barrier, the writes at the bottom in the subroutine will spit out junk on lvl=1 for the last six indices of all the arrays. Printing information to the screen seems to help, but more processors seem to lower the likelihood of the code working.
I've currently got it as MPI_ibcast with an MPI_wait, but I've also tried MPI_Bcast with MPI_barriers after. Changing the communicator from one defined by Amrex to MPI_COMM_WORLD doesn't help.
...
program so_bcast
!
!
!
!
use mpi
implicit none
integer :: lvl,i,a,b,c,ier,state(MPI_STATUS_SIZE),d
integer :: n_elems,req,counter,tag,flavor
integer :: stat(MPI_STATUS_SIZE)
integer :: self,nprocs
type :: box_zones
integer,allocatable :: lower(:,:),higher(:,:),little_zones(:)
double precision,allocatable :: lo_corner(:,:),hi_corner(:,:)
integer :: big_zones
integer,allocatable :: zone_start(:),zone_end(:)
end type
type(box_zones),allocatable :: zone_storage(:)
call MPI_INIT(ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ier)
call MPI_COMM_RANK(MPI_COMM_WORLD,self,ier)
lvl = 1
! Allocate everything, this is done elsewhere in the actual code, but done here
! for simplification reasons
allocate(zone_storage(0:lvl))
zone_storage(0)%big_zones = 4
zone_storage(1)%big_zones = 20
do i = 0,lvl
allocate(zone_storage(i)%lower(3,zone_storage(i)%big_zones))
allocate(zone_storage(i)%higher(3,zone_storage(i)%big_zones))
allocate(zone_storage(i)%lo_corner(3,zone_storage(i)%big_zones))
allocate(zone_storage(i)%hi_corner(3,zone_storage(i)%big_zones))
zone_storage(i)%lower = self
zone_storage(i)%higher = self*2+1
zone_storage(i)%lo_corner = self*1.0D0
zone_storage(i)%hi_corner = self*1.0D0+1.0D0
allocate(zone_storage(i)%zone_start(0:nprocs-1))
allocate(zone_storage(i)%zone_end(0:nprocs-1))
zone_storage(i)%zone_start(self) = zone_storage(i)%big_zones/nprocs*self+1
zone_storage(i)%zone_end(self) = zone_storage(i)%zone_start(self)+zone_storage(i)%big_zones/nprocs-1
if (zone_storage(i)%zone_end(self)>zone_storage(i)%big_zones) zone_storage(i)%zone_end(self) = zone_storage(i)%big_zones
end do
do i = 0,lvl
write(*,*) 'lower check 0',self,'lower',zone_storage(i)%lower
write(*,*) 'higher check 0',self,'high',zone_storage(i)%higher
write(*,*) 'lo_corner check 0',self,'lo_corner',zone_storage(i)%lo_corner
write(*,*) 'hi_corner check 0',self,'hi_corner',zone_storage(i)%hi_corner
write(*,*) 'big_zones check 0',self,'big_zones',zone_storage(i)%big_zones
write(*,*) 'zone start/end 0',self,'lvl',i,zone_storage(i)%zone_start,zone_storage(i)%zone_end
end do
!
! Agglomerate the appropriate data to processor 0 using non-blocking receives
! and blocking sends
!
do i = 0,lvl
do a = 0,nprocs-1
call mpi_bcast(zone_storage(i)%zone_start(a),1,&
MPI_INT,a,MPI_COMM_WORLD,ier)
call mpi_bcast(zone_storage(i)%zone_end(a),1,&
MPI_INT,a,MPI_COMM_WORLD,ier)
end do
end do
call MPI_BARRIER(MPI_COMM_WORLD,ier)
counter = 0
do i = 0,lvl
n_elems = 3*zone_storage(i)%big_zones
write(*,*) 'number of elements',n_elems
if (self == 0) then
do a = 1,nprocs-1
do c = zone_storage(i)%zone_start(a),zone_storage(i)%zone_end(a)
tag = c*100000+a*1000+1!+d*10
call mpi_irecv(zone_storage(i)%lower(1:3,c),3,MPI_INT,a,&
tag,MPI_COMM_WORLD,req,ier)
tag = tag + 1
call mpi_irecv(zone_storage(i)%higher(1:3,c),3,MPI_INT,a,&
tag,MPI_COMM_WORLD,req,ier)
tag = tag +1
call mpi_irecv(zone_storage(i)%lo_corner(1:3,c),3,MPI_DOUBLE_PRECISION,a,&
tag,MPI_COMM_WORLD,req,ier)
tag = tag +1
call mpi_irecv(zone_storage(i)%hi_corner(1:3,c),3,MPI_DOUBLE_PRECISION,a,&
tag,MPI_COMM_WORLD,req,ier)
end do
end do
else
do b = zone_storage(i)%zone_start(self),zone_storage(i)%zone_end(self)
tag = b*100000+self*1000+1!+d*10
call mpi_send(zone_storage(i)%lower(1:3,b),3,MPI_INT,0,&
tag,MPI_COMM_WORLD,ier)
tag = tag + 1
call mpi_send(zone_storage(i)%higher(1:3,b),3,MPI_INT,0,&
tag,MPI_COMM_WORLD,ier)
tag = tag + 1
call mpi_send(zone_storage(i)%lo_corner(1:3,b),3,MPI_DOUBLE_PRECISION,0,&
tag,MPI_COMM_WORLD,ier)
tag = tag +1
call mpi_send(zone_storage(i)%hi_corner(1:3,b),3,MPI_DOUBLE_PRECISION,0,&
tag,MPI_COMM_WORLD,ier)
end do
end if
end do
write(*,*) 'spack'
!
call mpi_barrier(MPI_COMM_WORLD,ier)
do i = 0,lvl
write(*,*) 'lower check 1',self,'lower',zone_storage(i)%lower
write(*,*) 'higher check 1',self,'high',zone_storage(i)%higher
write(*,*) 'lo_corner check 1',self,'lo_corner',zone_storage(i)%lo_corner
write(*,*) 'hi_corner check 1',self,'hi_corner',zone_storage(i)%hi_corner
write(*,*) 'big_zones check 1',self,'big_zones',zone_storage(i)%big_zones
write(*,*) 'zone start/end 1',self,'lvl',i,zone_storage(i)%zone_start,zone_storage(i)%zone_end
end do
!
! Send all the data out to all the processors
!
do i = 0,lvl
n_elems = 3*zone_storage(i)%big_zones
req = 1
call mpi_ibcast(zone_storage(i)%lower,n_elems,MPI_INT,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
write(*,*) 'spiffy'
req = 2
call mpi_ibcast(zone_storage(i)%higher,n_elems,MPI_INT,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
req = 3
call mpi_ibcast(zone_storage(i)%lo_corner,n_elems,MPI_DOUBLE_PRECISION,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
req = 4
call mpi_ibcast(zone_storage(i)%hi_corner,n_elems,MPI_DOUBLE_PRECISION,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
call mpi_barrier(MPI_COMM_WORLD,ier)
end do
write(*,*) 'lower check 2',self,'lower',zone_storage(lvl)%lower
write(*,*) 'higher check 2',self,'high',zone_storage(lvl)%higher
write(*,*) 'lo_corner check ',self,'lo_corner',zone_storage(lvl)%lo_corner
write(*,*) 'hi_corner check ',self,'hi_corner',zone_storage(lvl)%hi_corner
write(*,*) 'big_zones check ',self,'big_zones',zone_storage(lvl)%big_zones
call MPI_FINALIZE(ier)
end program
...
As I said, this code works, but the larger version does not always work. OpenMPI throws several warnings akin to this:
mca: base: component_find: ess "mca_ess_pmi" uses an MCA interface that is not recognized (component MCA v2.1.0 != supported MCA v2.0.0) -- ignored
mca: base: component_find: grpcomm "mca_grpcomm_direct" uses an MCA interface that is not recognized (component MCA v2.1.0 != supported MCA v2.0.0) -- ignored
mca: base: component_find: rcache "mca_rcache_grdma" uses an MCA interface that is not recognized (component MCA v2.1.0 != supported MCA v2.0.0) -- ignored
etc. etc. But the program can still complete even with those warnings.
-Is there a way to ensure that MPI_bcast has emptied its buffer into the correct region of memory before moving on? It seems to miss this sometimes.
-Is there a different/better method to distribute the data? The sizes have to be able to vary unlike the test program.
Thank you ahead of time.
The most straightforward answer was to use MPI_allgatherv. As little as I wanted to mess with displacements, it was the best setup to share the information and reduce overall code length.
I believe a MPI_waitall solution would work too, as the data was not being fully received before being broadcast.
I'm learning how to use fortran to do some data analysis. I'm working through the following example:
program linalg
implicit none
real :: v1(3), v2(3), m(3,3)
integer :: i,j
v1(1) = 0.25
v1(2) = 1.2
v1(3) = 0.2
! use nested do loops to initialise the matrix
! to the unit matrix
do i=1,3
do j=1,3
m(i,j) = 0.0
end do
m(i,j) = 1.0
end do
! do a matrix multiplicationof a vector equivalent to v2i = mij v1j
do i = 1,3
v2(i) = 0.0
do j = 1,3
v2(i) = v2(i) + m(i,j)*v1(j)
end do
end do
write(*,*) 'v2 = ', v2
end program linalg
which I execute with
f95 -o linalg linalg.f90
./linalg
However, I get the following message in the terminal:
Bus error
Some links that I've followed online suggest that this is to do with not having pre-define a variable, but I am sure that I have in this script and cannot find where the error is coming from. Is there another reason I would be getting this error?
Your mistake is in here
do i=1,3
do j=1,3
m(i,j) = 0.0
end do
m(i,j) = 1.0 ! here be a dragon
end do
Fortran is explicit in stating that after the end of a loop the value of the index variable is 1 greater than the value it had on the last iteration of the loop. So in this case the statement m(i,j) = 1.0 will try to address m(1,4) at the first go round, then m(2,4), and so forth.
Sometimes you get 'lucky' with an attempt to write outside the bounds of an array and the write stays inside the address space of the process you are working in. 'Lucky' in the sense that your program is wrong but doesn't crash -- this crash is a much better situation to be in. The bus error suggests that the compiler has generated an address to write to that lies in forbidden territory for any process.
You could have found this yourself by turning on 'run-time bounds checking' with your compiler. Your compiler's documentation, or other Qs and As here on SO, will tell you how to do that.
I'll leave it to you to fix this as you wish, you show every sign of being able to figure it out now you know the rules.
I have some issues about opening and reading multiple files. I have to write a code which reads two columns in n files formatted in the same way (they are different only for the values...). Before this, I open another input file and an output file in which I will write my results. I read other questions in this forum (such as this one) and tried to do the same thing, but I receive these errors:
read(fileinp,'(I5)') i-49
1
devstan.f90:20.24:
fileLoop : do i = 50,52
2
Error: Variable 'i' at (1) cannot be redefined inside loop beginning at (2)
and
read(fileinp,'(I5)') i-49
1
Error: Invalid character in name at (1)
My files are numbered from 1 to n and are named 'lin*27-n.dat' (where n is the index starts from 1) and the code is:
program deviation
implicit none
character(len=15) :: filein,fileout,fileinp
integer :: row,i,h
real :: usv,usf,tsv,tsf,diff
write(*,'(2x,''Input file .......''/)')
read(*,'(a12)') filein
write(*,'(2x,''Output file........''/)')
read(*,'(a12)') fileout
open(unit = 30,File=filein)
open(unit = 20,File=fileout)
fileLoop : do i = 50,52
fileinp = 'lin*27-'
read(fileinp,'(I5)') i-49
open(unit = i,File=fileinp)
do row = 1,24
read(30,*) h,usv,tsv
read(i,*) h,usf,tsf
diff = usf - usv
write(20,*) diff
enddo
close(i)
enddo fileLoop
end program deviation
How can I solve it? I am not pro in Fortran, so please don't use difficult language, thanks.
The troublesome line is
read(fileinp,'(I5)') i-49
You surely mean to do a write (as in the example linked): this read statement attempts to read from the variable fileinp rather than writing to it.
That said, simply replacing with write is probably not what you need either. This will ignore the previous line
fileinp = 'lin*27-'
merely setting to, in turn, "1", "2", "3" (with leading blanks). Something like (assuming you intend that * to be there)
write(fileinp, '("lin*27-",I1)') i-49
Note also the use of I1 in the format, rather than I5: one may want to avoid blanks in the filename. [This is suitable when there is exactly one digit; look up Iw.m and I0 when generalizing.]
I am completely new to Fortran and pretty new to programming in general. I am trying to compile a script someone else has written. This is giving me a few problems. The top half of the code is:
C
C Open direct-access output file ('JPLEPH')
C
OPEN ( UNIT = 12,
. FILE = 'JPLEPH',
. ACCESS = 'DIRECT',
. FORM = 'UNFORMATTED',
. RECL = IRECSZ,
. STATUS = 'NEW' )
C
C Read and write the ephemeris data records (GROUP 1070).
C
CALL NXTGRP ( HEADER )
IF ( HEADER .NE. 'GROUP 1070' ) CALL ERRPRT(1070,'NOT HEADER')
NROUT = 0
IN = 0
OUT = 0
1 READ(*,'(2i6)')NRW,NCOEFF
if(NRW .EQ. 0) GO TO 1
READ (*,'(3D26.18)',IOSTAT =IN) (DB(K),K=1,NCOEFF)
DO WHILE ( ( IN .EQ. 0 )
. .AND. ( DB(2) .LT. T2) )
IF ( 2*NCOEFF .NE. KSIZE ) THEN
CALL ERRPRT(NCOEFF,' 2*NCOEFF not equal to KSIZE')
ENDIF
C
C Skip this data block if the end of the interval is less
C than the specified start time or if the it does not begin
C where the previous block ended.
C
IF ( (DB(2) .GE. T1) .AND. (DB(1) .GE. DB2Z) ) THEN
IF ( FIRST ) THEN
C
C Don't worry about the intervals overlapping
C or abutting if this is the first applicable
C interval.
C
DB2Z = DB(1)
FIRST = .FALSE.
ENDIF
IF (DB(1) .NE. DB2Z ) THEN
C
C Beginning of current interval is past the end
C of the previous one.
CALL ERRPRT (NRW, 'Records do not overlap or abut')
ENDIF
DB2Z = DB(2)
NROUT = NROUT + 1
print*,'Out =', OUT
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K),K=1,NCOEFF)
print*,'Out2 =', OUT
IF ( OUT .NE. 0 ) THEN
CALL ERRPRT (NROUT,
. 'th record not written because of error')
ENDIF
So, when I print "Out" and "Out2" to the screen I find that Out=0 and Out2=110. As it is not longer equal to zero, the program gives me an error. Therefore I am basically wondering about what is happening here:
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K),K=1,NCOEFF)
I assume that 12 refers to the file I have opened (and created), and want to write to. What does the rest of the first brackets do? And what is the point of the second? Does that gives me the information I want to put in my file? In case, where does DB get filled with that?
Generally I am wondering what is going wrong? Why does OUT change value? (I need t
NCOEFF is defined as an Integer in the beginning of the programme, and DB: as DOUBLE PRECISION DB(3000), DB2Z/0.d0/ , so I assume DB is an array of some sort.
In this program, OUT is telling if the write statement was successful or not. (the IOSTAT parameter to the write statement means "I/O status", or input/output status). It returns 0 if the I/O operation was a success, or the number of the error code otherwise. You can find what the error codes mean here.
I'm not familiar with the REC parameter, but a starting place to investigate yourself can be found here.
To quote the handbook REC indicates the record number to be read or written. As advised, see the documentation which accompanies your compiler for further explanation.
(DB(K),K=1,NCOEFF) means 'all the elements in DB from 1 to NCOEFF. You are looking at an io-implied-do statement.
The statement
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K),K=1,NCOEFF)
is, if memory serves me, called an "implied DO-loop". As written, it will write NCOEFF values from array DB, starting at DB(1).
It is called an implied DO-loop because the explicit form would be (in FORTRAN IV, for the ancients: I know it a lot better than the more modern variations) something along the lines of:
DO 10 K=1,NCOEFF
WRITE (12,REC=NROUT+2,IOSTAT=OUT) DB(K)
10 CONTINUE
(Pretend that the first two lines are indented six columns.) This is a DO-loop. The implied DO-loop form lets you put the "loop" right in the input/output statement.
What makes it useful is that you can have multiple arrays, and multiple loops. For a simple example:
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K), DC(K), K=1,NCOEFF)
110 is the error code thrown by the WRITE call. You need to check your FORTRAN RTL (run-time library) reference. It should list the possible error codes. I think 110 means that you're trying to convert a double-precision value to an integer, but the value is bigger than you can store in an integer. Maybe dump the values in DB and see.