MPI master unable to receive - fortran

I am using MPI in fortran for computation of my data. I verified by printing the data that, computations are being performed on the desired rang by each process just fine but, it the master is unable to collate the data.
Here is the code that I am trying to make it work:
EDIT: Created a tag which is constant for the send and recv
integer :: tag
tag = 123
if(pid.ne.0) then
print *,'pid: ',pid,'sending'
DO j = start_index+1, end_index
CALL MPI_SEND(datapacket(j),1, MPI_REAL,0, tag, MPI_COMM_WORLD)
!print *,'sending'
END DO
print *,'send complete'
else
DO slave_id = 1, npe-1
rec_start_index = slave_id*population_size+1
rec_end_index = (slave_id + 1) * population_size;
IF (slave_id == npe-1) THEN
rec_end_index = total-1;
ENDIF
print *,'received 1',rec_start_index,rec_end_index
CALL MPI_RECV(datapacket(j),1,MPI_REAL,slave_id,tag,MPI_COMM_WORLD, &
& status)
!print *,'received 2',rec_start_index,rec_end_index
END DO
It never prints received or anything after the MPI_RECV call but, I can see the sending happening just fine however, there is no way I can verify it except to rely on the print statements.
The variable databpacket is initialized as follows:
real, dimension (:), allocatable :: datapacket
Is there any thing that I am doing wrong here?
EDIT: For the test setup all the process are being run on the localhost.

You are using different message tags for all the sends, however in your receive you use just j, which is never altered on the root process. Also note that your implementation looks like a MPI_Gather, which I'd recommend you to use instead of implementing this yourself.
EDIT: Sorry, after your update I now, realize, that you are in fact sending multiple messages from each rank>0 (start_index+1 up to end_index), if you need that, you do need to have tags differentiating the individual messages. However, you then also need to have multiple receives on your master.
Maybe it would be better to state, what you actually want to achieve.
Do you want something like this:
integer :: tag
tag = 123
if(pid.ne.0) then
print *,'pid: ',pid,'sending'
CALL MPI_SEND(datapacket(start_index+1:end_index),end_index-start_index, MPI_REAL,0, tag, MPI_COMM_WORLD)
!print *,'sending'
print *,'send complete'
else
DO slave_id = 1, npe-1
rec_start_index = slave_id*population_size+1
rec_end_index = (slave_id + 1) * population_size;
IF (slave_id == npe-1) THEN
rec_end_index = total-1;
ENDIF
print *,'received 1',rec_start_index,rec_end_index
CALL MPI_RECV(datapacket(rec_start_index:rec_end_index),rec_end_index-rec_start_index+1,MPI_REAL,slave_id,tag,MPI_COMM_WORLD, &
& status)
!print *,'received 2',rec_start_index,rec_end_index
END DO
end if

Related

MPI_Bcast non-root nodes not receiving all data

So, I'm currently writing a small scale code that does output using CGNS and adaptive mesh refinement (AMR) with Amrex. This is all being done with Fortran 95, though CGNS is C with Fortran interfaces and Amrex is C++ with Fortran interfaces (those are not in the sample code). I'm using OpenMPI 1.10.7.
This will eventually go into a full CFD code, but I wanted to test it small scale to work the bugs out before putting in the larger code. The program below seems to works every time, but it was originally a subroutine that did not.
I'm having an issue where not all of the data from MPI_Bcast is being received by every process... sometimes. I can hit execute on the same code, twice in a row and sometimes is bombs out (segfault from CGNS elsewhere in the code, and sometimes it works. As far as I can tell, the program bombs when not all of the data from MPI_Bcast is received in time to start work elsewhere. Despite MPI_wait and MPI_barrier, the writes at the bottom in the subroutine will spit out junk on lvl=1 for the last six indices of all the arrays. Printing information to the screen seems to help, but more processors seem to lower the likelihood of the code working.
I've currently got it as MPI_ibcast with an MPI_wait, but I've also tried MPI_Bcast with MPI_barriers after. Changing the communicator from one defined by Amrex to MPI_COMM_WORLD doesn't help.
...
program so_bcast
!
!
!
!
use mpi
implicit none
integer :: lvl,i,a,b,c,ier,state(MPI_STATUS_SIZE),d
integer :: n_elems,req,counter,tag,flavor
integer :: stat(MPI_STATUS_SIZE)
integer :: self,nprocs
type :: box_zones
integer,allocatable :: lower(:,:),higher(:,:),little_zones(:)
double precision,allocatable :: lo_corner(:,:),hi_corner(:,:)
integer :: big_zones
integer,allocatable :: zone_start(:),zone_end(:)
end type
type(box_zones),allocatable :: zone_storage(:)
call MPI_INIT(ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ier)
call MPI_COMM_RANK(MPI_COMM_WORLD,self,ier)
lvl = 1
! Allocate everything, this is done elsewhere in the actual code, but done here
! for simplification reasons
allocate(zone_storage(0:lvl))
zone_storage(0)%big_zones = 4
zone_storage(1)%big_zones = 20
do i = 0,lvl
allocate(zone_storage(i)%lower(3,zone_storage(i)%big_zones))
allocate(zone_storage(i)%higher(3,zone_storage(i)%big_zones))
allocate(zone_storage(i)%lo_corner(3,zone_storage(i)%big_zones))
allocate(zone_storage(i)%hi_corner(3,zone_storage(i)%big_zones))
zone_storage(i)%lower = self
zone_storage(i)%higher = self*2+1
zone_storage(i)%lo_corner = self*1.0D0
zone_storage(i)%hi_corner = self*1.0D0+1.0D0
allocate(zone_storage(i)%zone_start(0:nprocs-1))
allocate(zone_storage(i)%zone_end(0:nprocs-1))
zone_storage(i)%zone_start(self) = zone_storage(i)%big_zones/nprocs*self+1
zone_storage(i)%zone_end(self) = zone_storage(i)%zone_start(self)+zone_storage(i)%big_zones/nprocs-1
if (zone_storage(i)%zone_end(self)>zone_storage(i)%big_zones) zone_storage(i)%zone_end(self) = zone_storage(i)%big_zones
end do
do i = 0,lvl
write(*,*) 'lower check 0',self,'lower',zone_storage(i)%lower
write(*,*) 'higher check 0',self,'high',zone_storage(i)%higher
write(*,*) 'lo_corner check 0',self,'lo_corner',zone_storage(i)%lo_corner
write(*,*) 'hi_corner check 0',self,'hi_corner',zone_storage(i)%hi_corner
write(*,*) 'big_zones check 0',self,'big_zones',zone_storage(i)%big_zones
write(*,*) 'zone start/end 0',self,'lvl',i,zone_storage(i)%zone_start,zone_storage(i)%zone_end
end do
!
! Agglomerate the appropriate data to processor 0 using non-blocking receives
! and blocking sends
!
do i = 0,lvl
do a = 0,nprocs-1
call mpi_bcast(zone_storage(i)%zone_start(a),1,&
MPI_INT,a,MPI_COMM_WORLD,ier)
call mpi_bcast(zone_storage(i)%zone_end(a),1,&
MPI_INT,a,MPI_COMM_WORLD,ier)
end do
end do
call MPI_BARRIER(MPI_COMM_WORLD,ier)
counter = 0
do i = 0,lvl
n_elems = 3*zone_storage(i)%big_zones
write(*,*) 'number of elements',n_elems
if (self == 0) then
do a = 1,nprocs-1
do c = zone_storage(i)%zone_start(a),zone_storage(i)%zone_end(a)
tag = c*100000+a*1000+1!+d*10
call mpi_irecv(zone_storage(i)%lower(1:3,c),3,MPI_INT,a,&
tag,MPI_COMM_WORLD,req,ier)
tag = tag + 1
call mpi_irecv(zone_storage(i)%higher(1:3,c),3,MPI_INT,a,&
tag,MPI_COMM_WORLD,req,ier)
tag = tag +1
call mpi_irecv(zone_storage(i)%lo_corner(1:3,c),3,MPI_DOUBLE_PRECISION,a,&
tag,MPI_COMM_WORLD,req,ier)
tag = tag +1
call mpi_irecv(zone_storage(i)%hi_corner(1:3,c),3,MPI_DOUBLE_PRECISION,a,&
tag,MPI_COMM_WORLD,req,ier)
end do
end do
else
do b = zone_storage(i)%zone_start(self),zone_storage(i)%zone_end(self)
tag = b*100000+self*1000+1!+d*10
call mpi_send(zone_storage(i)%lower(1:3,b),3,MPI_INT,0,&
tag,MPI_COMM_WORLD,ier)
tag = tag + 1
call mpi_send(zone_storage(i)%higher(1:3,b),3,MPI_INT,0,&
tag,MPI_COMM_WORLD,ier)
tag = tag + 1
call mpi_send(zone_storage(i)%lo_corner(1:3,b),3,MPI_DOUBLE_PRECISION,0,&
tag,MPI_COMM_WORLD,ier)
tag = tag +1
call mpi_send(zone_storage(i)%hi_corner(1:3,b),3,MPI_DOUBLE_PRECISION,0,&
tag,MPI_COMM_WORLD,ier)
end do
end if
end do
write(*,*) 'spack'
!
call mpi_barrier(MPI_COMM_WORLD,ier)
do i = 0,lvl
write(*,*) 'lower check 1',self,'lower',zone_storage(i)%lower
write(*,*) 'higher check 1',self,'high',zone_storage(i)%higher
write(*,*) 'lo_corner check 1',self,'lo_corner',zone_storage(i)%lo_corner
write(*,*) 'hi_corner check 1',self,'hi_corner',zone_storage(i)%hi_corner
write(*,*) 'big_zones check 1',self,'big_zones',zone_storage(i)%big_zones
write(*,*) 'zone start/end 1',self,'lvl',i,zone_storage(i)%zone_start,zone_storage(i)%zone_end
end do
!
! Send all the data out to all the processors
!
do i = 0,lvl
n_elems = 3*zone_storage(i)%big_zones
req = 1
call mpi_ibcast(zone_storage(i)%lower,n_elems,MPI_INT,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
write(*,*) 'spiffy'
req = 2
call mpi_ibcast(zone_storage(i)%higher,n_elems,MPI_INT,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
req = 3
call mpi_ibcast(zone_storage(i)%lo_corner,n_elems,MPI_DOUBLE_PRECISION,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
req = 4
call mpi_ibcast(zone_storage(i)%hi_corner,n_elems,MPI_DOUBLE_PRECISION,&
0,MPI_COMM_WORLD,req,ier)
call mpi_wait(req,stat,ier)
call mpi_barrier(MPI_COMM_WORLD,ier)
end do
write(*,*) 'lower check 2',self,'lower',zone_storage(lvl)%lower
write(*,*) 'higher check 2',self,'high',zone_storage(lvl)%higher
write(*,*) 'lo_corner check ',self,'lo_corner',zone_storage(lvl)%lo_corner
write(*,*) 'hi_corner check ',self,'hi_corner',zone_storage(lvl)%hi_corner
write(*,*) 'big_zones check ',self,'big_zones',zone_storage(lvl)%big_zones
call MPI_FINALIZE(ier)
end program
...
As I said, this code works, but the larger version does not always work. OpenMPI throws several warnings akin to this:
mca: base: component_find: ess "mca_ess_pmi" uses an MCA interface that is not recognized (component MCA v2.1.0 != supported MCA v2.0.0) -- ignored
mca: base: component_find: grpcomm "mca_grpcomm_direct" uses an MCA interface that is not recognized (component MCA v2.1.0 != supported MCA v2.0.0) -- ignored
mca: base: component_find: rcache "mca_rcache_grdma" uses an MCA interface that is not recognized (component MCA v2.1.0 != supported MCA v2.0.0) -- ignored
etc. etc. But the program can still complete even with those warnings.
-Is there a way to ensure that MPI_bcast has emptied its buffer into the correct region of memory before moving on? It seems to miss this sometimes.
-Is there a different/better method to distribute the data? The sizes have to be able to vary unlike the test program.
Thank you ahead of time.
The most straightforward answer was to use MPI_allgatherv. As little as I wanted to mess with displacements, it was the best setup to share the information and reduce overall code length.
I believe a MPI_waitall solution would work too, as the data was not being fully received before being broadcast.

How to read all the items from the Channel when you don't know their actual count

I'm trying to implement a crawler that visits some URL, collects new relative URLs from it and builds a report. I'm trying to do it concurrently using Crystal fibers and channels, like the following:
urls = [...] # of String
visited_urls = []
pool_size.times do
spawn do
loop do
url = urls.shift?
break if url.nil?
channel.send(url) if some_condition
end
end
end
# TODO: here the problem!
loop do
url = channel.receive?
break if url.nil? || channel.closed?
visited_urls << url
end
puts visited_urls.inspect
But here I have a problem - infinite second loop (it calls channel.receive? till the last item in the channel and than waits for a new message that never arrives). Issue exists because I never know how much items actually in the channel, so I can't do like proposed in the Concurency section of the Crystal lang Guides.
So maybe there are some good practices how to work with the channel when we don't know how much items it will store and we need to receive? Thanks!
A common solution to this is to have a kill value. Either as part of the main data flow like this:
results = Channel(String|Symbol).new(POOL_SIZE * 2)
POOL_SIZE.times do
spawn do
while has_work?
results.send "some work result"
end
results.send :done
end
end
done_workers = 0
loop do
message = results.receive
if message == :done
done_workers += 1
break if done_workers == POOL_SIZE
elsif message.is_a? String
puts "Got: #{message}"
end
end
Or via a secondary channel to signal the event:
results = Channel(String).new(POOL_SIZE * 2)
done = Channel(Nil).new(POOL_SIZE)
POOL_SIZE.times do
spawn do
while has_work?
results.send "some work result"
end
done.send nil
end
end
done_workers = 0
loop do
select
when message = results.receive
puts "Got: #{message}"
when done.receive
done_workers += 1
break if done_workers == POOL_SIZE
end
end

How do I diagnose bus error in fortran

I'm learning how to use fortran to do some data analysis. I'm working through the following example:
program linalg
implicit none
real :: v1(3), v2(3), m(3,3)
integer :: i,j
v1(1) = 0.25
v1(2) = 1.2
v1(3) = 0.2
! use nested do loops to initialise the matrix
! to the unit matrix
do i=1,3
do j=1,3
m(i,j) = 0.0
end do
m(i,j) = 1.0
end do
! do a matrix multiplicationof a vector equivalent to v2i = mij v1j
do i = 1,3
v2(i) = 0.0
do j = 1,3
v2(i) = v2(i) + m(i,j)*v1(j)
end do
end do
write(*,*) 'v2 = ', v2
end program linalg
which I execute with
f95 -o linalg linalg.f90
./linalg
However, I get the following message in the terminal:
Bus error
Some links that I've followed online suggest that this is to do with not having pre-define a variable, but I am sure that I have in this script and cannot find where the error is coming from. Is there another reason I would be getting this error?
Your mistake is in here
do i=1,3
do j=1,3
m(i,j) = 0.0
end do
m(i,j) = 1.0 ! here be a dragon
end do
Fortran is explicit in stating that after the end of a loop the value of the index variable is 1 greater than the value it had on the last iteration of the loop. So in this case the statement m(i,j) = 1.0 will try to address m(1,4) at the first go round, then m(2,4), and so forth.
Sometimes you get 'lucky' with an attempt to write outside the bounds of an array and the write stays inside the address space of the process you are working in. 'Lucky' in the sense that your program is wrong but doesn't crash -- this crash is a much better situation to be in. The bus error suggests that the compiler has generated an address to write to that lies in forbidden territory for any process.
You could have found this yourself by turning on 'run-time bounds checking' with your compiler. Your compiler's documentation, or other Qs and As here on SO, will tell you how to do that.
I'll leave it to you to fix this as you wish, you show every sign of being able to figure it out now you know the rules.

How to open and read multiple files in Fortran 90

I have some issues about opening and reading multiple files. I have to write a code which reads two columns in n files formatted in the same way (they are different only for the values...). Before this, I open another input file and an output file in which I will write my results. I read other questions in this forum (such as this one) and tried to do the same thing, but I receive these errors:
read(fileinp,'(I5)') i-49
1
devstan.f90:20.24:
fileLoop : do i = 50,52
2
Error: Variable 'i' at (1) cannot be redefined inside loop beginning at (2)
and
read(fileinp,'(I5)') i-49
1
Error: Invalid character in name at (1)
My files are numbered from 1 to n and are named 'lin*27-n.dat' (where n is the index starts from 1) and the code is:
program deviation
implicit none
character(len=15) :: filein,fileout,fileinp
integer :: row,i,h
real :: usv,usf,tsv,tsf,diff
write(*,'(2x,''Input file .......''/)')
read(*,'(a12)') filein
write(*,'(2x,''Output file........''/)')
read(*,'(a12)') fileout
open(unit = 30,File=filein)
open(unit = 20,File=fileout)
fileLoop : do i = 50,52
fileinp = 'lin*27-'
read(fileinp,'(I5)') i-49
open(unit = i,File=fileinp)
do row = 1,24
read(30,*) h,usv,tsv
read(i,*) h,usf,tsf
diff = usf - usv
write(20,*) diff
enddo
close(i)
enddo fileLoop
end program deviation
How can I solve it? I am not pro in Fortran, so please don't use difficult language, thanks.
The troublesome line is
read(fileinp,'(I5)') i-49
You surely mean to do a write (as in the example linked): this read statement attempts to read from the variable fileinp rather than writing to it.
That said, simply replacing with write is probably not what you need either. This will ignore the previous line
fileinp = 'lin*27-'
merely setting to, in turn, "1", "2", "3" (with leading blanks). Something like (assuming you intend that * to be there)
write(fileinp, '("lin*27-",I1)') i-49
Note also the use of I1 in the format, rather than I5: one may want to avoid blanks in the filename. [This is suitable when there is exactly one digit; look up Iw.m and I0 when generalizing.]

New to Fortran, questions about writing to file

I am completely new to Fortran and pretty new to programming in general. I am trying to compile a script someone else has written. This is giving me a few problems. The top half of the code is:
C
C Open direct-access output file ('JPLEPH')
C
OPEN ( UNIT = 12,
. FILE = 'JPLEPH',
. ACCESS = 'DIRECT',
. FORM = 'UNFORMATTED',
. RECL = IRECSZ,
. STATUS = 'NEW' )
C
C Read and write the ephemeris data records (GROUP 1070).
C
CALL NXTGRP ( HEADER )
IF ( HEADER .NE. 'GROUP 1070' ) CALL ERRPRT(1070,'NOT HEADER')
NROUT = 0
IN = 0
OUT = 0
1 READ(*,'(2i6)')NRW,NCOEFF
if(NRW .EQ. 0) GO TO 1
READ (*,'(3D26.18)',IOSTAT =IN) (DB(K),K=1,NCOEFF)
DO WHILE ( ( IN .EQ. 0 )
. .AND. ( DB(2) .LT. T2) )
IF ( 2*NCOEFF .NE. KSIZE ) THEN
CALL ERRPRT(NCOEFF,' 2*NCOEFF not equal to KSIZE')
ENDIF
C
C Skip this data block if the end of the interval is less
C than the specified start time or if the it does not begin
C where the previous block ended.
C
IF ( (DB(2) .GE. T1) .AND. (DB(1) .GE. DB2Z) ) THEN
IF ( FIRST ) THEN
C
C Don't worry about the intervals overlapping
C or abutting if this is the first applicable
C interval.
C
DB2Z = DB(1)
FIRST = .FALSE.
ENDIF
IF (DB(1) .NE. DB2Z ) THEN
C
C Beginning of current interval is past the end
C of the previous one.
CALL ERRPRT (NRW, 'Records do not overlap or abut')
ENDIF
DB2Z = DB(2)
NROUT = NROUT + 1
print*,'Out =', OUT
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K),K=1,NCOEFF)
print*,'Out2 =', OUT
IF ( OUT .NE. 0 ) THEN
CALL ERRPRT (NROUT,
. 'th record not written because of error')
ENDIF
So, when I print "Out" and "Out2" to the screen I find that Out=0 and Out2=110. As it is not longer equal to zero, the program gives me an error. Therefore I am basically wondering about what is happening here:
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K),K=1,NCOEFF)
I assume that 12 refers to the file I have opened (and created), and want to write to. What does the rest of the first brackets do? And what is the point of the second? Does that gives me the information I want to put in my file? In case, where does DB get filled with that?
Generally I am wondering what is going wrong? Why does OUT change value? (I need t
NCOEFF is defined as an Integer in the beginning of the programme, and DB: as DOUBLE PRECISION DB(3000), DB2Z/0.d0/ , so I assume DB is an array of some sort.
In this program, OUT is telling if the write statement was successful or not. (the IOSTAT parameter to the write statement means "I/O status", or input/output status). It returns 0 if the I/O operation was a success, or the number of the error code otherwise. You can find what the error codes mean here.
I'm not familiar with the REC parameter, but a starting place to investigate yourself can be found here.
To quote the handbook REC indicates the record number to be read or written. As advised, see the documentation which accompanies your compiler for further explanation.
(DB(K),K=1,NCOEFF) means 'all the elements in DB from 1 to NCOEFF. You are looking at an io-implied-do statement.
The statement
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K),K=1,NCOEFF)
is, if memory serves me, called an "implied DO-loop". As written, it will write NCOEFF values from array DB, starting at DB(1).
It is called an implied DO-loop because the explicit form would be (in FORTRAN IV, for the ancients: I know it a lot better than the more modern variations) something along the lines of:
DO 10 K=1,NCOEFF
WRITE (12,REC=NROUT+2,IOSTAT=OUT) DB(K)
10 CONTINUE
(Pretend that the first two lines are indented six columns.) This is a DO-loop. The implied DO-loop form lets you put the "loop" right in the input/output statement.
What makes it useful is that you can have multiple arrays, and multiple loops. For a simple example:
WRITE (12,REC=NROUT+2,IOSTAT=OUT) (DB(K), DC(K), K=1,NCOEFF)
110 is the error code thrown by the WRITE call. You need to check your FORTRAN RTL (run-time library) reference. It should list the possible error codes. I think 110 means that you're trying to convert a double-precision value to an integer, but the value is bigger than you can store in an integer. Maybe dump the values in DB and see.