Related
I am creating an MPI program, where I am trying to send the same data to all processes as soon as they finish their calculation. The processes can have large differences in their computation time, so I don't want that one processor waits for another.
The root process is guaranteed to always send first.
I know that MPI_Bcast acts as a Barries, so I experimented with MPI_IBcast:
program main
use mpi
implicit none
integer rank, nprcos, ierror, a(10), req
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprcos, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
a = rank
if(rank /= 2) then
call MPI_IBCAST(a, size(a), MPI_INTEGER, 0, MPI_COMM_WORLD, req, ierror)
call MPI_WAIT(req, MPI_STATUS_IGNORE, IERROR)
endif
write (*,*) 'Hello World from process: ', rank, 'of ', nprcos, "a = ", a(1)
call MPI_FINALIZE(ierror)
end program main
From my experiments it seems, that irregardless of which rank is "boycotting" the MPI_IBcast it always works on all the others:
> $ mpifort test.f90 && mpiexec --mca btl tcp,self -np 4 ./a.out
Hello World from process: 2 of 4 a = 2
Hello World from process: 1 of 4 a = 0
Hello World from process: 0 of 4 a = 0
Hello World from process: 3 of 4 a = 0
Is this a guaranteed behavior or is this just specific to my OpenMPI implementation? How else could I implement this? I can only think of loop over MPI_Isends.
No, this is not guaranteed, all ranks in the communicator should participate. Within MPI this is the definition of a collective communication.
I am using persistent communication in my CFD code. I have the communications setup in another subroutine and in the main subroutine, where I have the do loop, I use the MPI_STARTALL(), MPI_WAITALL().
In order to make it shorter, I am showing hte first part of the setup. The rest of the arrays are exactly the same.
My setup subrotuine looks like:
Subroutine MPI_Subroutine
use Variables
use mpi
implicit none
!Starting up MPI
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,npes,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,MyRank,ierr)
!Compute the size of local block (1D Decomposition)
Jmax = JmaxGlobal
Imax = ImaxGlobal/npes
if (MyRank.lt.(ImaxGlobal - npes*Imax)) then
Imax = Imax + 1
end if
if (MyRank.ne.0.and.MyRank.ne.(npes-1)) then
Imax = Imax + 2
Else
Imax = Imax + 1
endif
! Computing neighboars
if (MyRank.eq.0) then
Left = MPI_PROC_NULL
else
Left = MyRank - 1
end if
if (MyRank.eq.(npes -1)) then
Right = MPI_PROC_NULL
else
Right = MyRank + 1
end if
! Initializing the Arrays in each processor, according to the number of local nodes
Call InitializeArrays
!Creating the channel of communication for this computation,
!Sending and receiving the u_old (Ghost cells)
Call MPI_SEND_INIT(u_old(2,:),Jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(1),ierr)
Call MPI_RECV_INIT(u_old(Imax,:),jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(2),ierr)
Call MPI_SEND_INIT(u_old(Imax-1,:),Jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(3),ierr)
Call MPI_RECV_INIT(u_old(1,:),jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(4),ierr)
Since I am debugging my code I am just checking these arrays. When I check my ghost cells are full of zeroes. Then I guess that I messing with the instruction.
The main code, where I call the MPI_STARTALL, MPI_WAITALL looks like:
Program
use Variables
use mpi
implicit none
open(32, file = 'error.dat')
Call MPI_Subroutine
!kk=kk+1
DO kk=1, 2001
! A lot of calculation
! communicating the maximum error among the processes and delta t
call MPI_REDUCE(eps,epsGlobal,1,MPI_DOUBLE_PRECISION,MPI_MAX,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(epsGlobal,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_DOUBLE_PRECISION,MPI_MIN,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) delta_t = delta_tGlobal
call MPI_BCAST(delta_t,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) then
write(*,*) kk,epsGlobal,(kk*delta_t)
write(32,*) kk,epsGlobal
endif
Call Swap
Call MPI_STARTALL(4,req,ierr) !
Call MPI_WAITALL(4,req,status,ierr)
enddo
The variables are set in another module. the MPI related variables looks like:
! MPI variables
INTEGER :: npes, MyRank, ierr, Left, Right, tag
INTEGER :: status(MPI_STATUS_SIZE,4)
INTEGER,dimension(4) :: req
I appreciate your time and suggestion in this problem.
I have a simple program, which is supposed to gather a number of small arrays into one big one using MPI.
PROGRAM main
include 'mpif.h'
integer ierr, i, myrank, thefile, n_procs
integer, parameter :: BUFSIZE = 3
complex*16, allocatable :: loc_arr(:), glob_arr(:)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, n_procs, ierr)
allocate(loc_arr(BUFSIZE))
loc_arr = 0.7 * myrank - cmplx(0.3, 0, kind=8)
allocate(glob_arr(n_procs* BUFSIZE))
write (*,*) myrank, shape(glob_arr)
call MPI_Gather(loc_arr, BUFSIZE, MPI_DOUBLE_COMPLEX,&
glob_arr, n_procs * BUFSIZE, MPI_DOUBLE_COMPLEX,&
0, MPI_COMM_WORLD, ierr)
write (*,*) myrank,"Errorcode:" , ierr
call MPI_FINALIZE(ierr)
END PROGRAM main
I have some experience with MPI in C, but for Fortran 90 nothing seems to work. Here is how I compile(I use ifort) and run it:
mpif90 test.f90 -check all && mpirun -np 4 ./a.out
1 12
3 12
3 Errorcode: 0
1 Errorcode: 0
0 12
2 12
2 Errorcode: 0
0 Errorcode: 0
*** Error in `./a.out': free(): invalid pointer: 0x0000000000a25790 ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 10889 RUNNING AT LenovoX1kabel
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 10889 RUNNING AT LenovoX1kabel
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
What do I do wrong? Sometimes I will get this pointer problem, sometimes I will a segmentation fault, but to me it doesn't look like any of the ifort checks complain.
All the Errorcodes are 0, so I'm not sure where I go wrong.
You should never specify the number of processes in MPI collectives. That is a simple rule of thumb.
Therefore the line n_procs * BUFSIZE is clearly wrong.
And indeed the manual states that: recvcount Number of elements for any single receive (integer, significant only at root).
You should just use BUFSIZE. This is the same for C and Fortran.
I am trying to reproduce this C example in Fortran. my code so far:
use mpi
implicit none
integer, parameter :: maxn = 8
integer, allocatable :: xlocal(:,:)
integer :: i, j, lsize, errcnt, toterr, buff
integer :: ierror, nproc, pid, root = 0, nreq = 0
integer, allocatable :: request(:), status(:,:)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, pid, ierror)
if (mod(maxn, nproc) /= 0) then
write(*,*) 'Array size (maxn) should be a multiple of the number of processes'
call MPI_ABORT(MPI_COMM_WORLD, 1, ierror)
end if
lsize = maxn/nproc
allocate(xlocal(0:lsize+1, maxn))
allocate(request(nproc))
allocate(status(MPI_STATUS_SIZE,nproc))
xlocal(0,:) = -1
xlocal(1:lsize,:) = pid
xlocal(lsize+1,:) = -1
! send down unless on bottom
if (pid < nproc-1) then
nreq = nreq + 1
call MPI_ISEND(xlocal(lsize,:), maxn, MPI_INTEGER, &
pid+1, 0, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' sent to process ', pid+1, ':'
write(*,*) xlocal(lsize,:)
end if
if (pid > 0) then
nreq = nreq + 1
call MPI_IRECV(xlocal(0,:), maxn, MPI_INTEGER, &
pid-1, 0, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' received from process ', pid-1, ':'
write(*,*) xlocal(0,:)
end if
! send up unless on top
if (pid > 0) then
nreq = nreq + 1
call MPI_ISEND(xlocal(1,:), maxn, MPI_INTEGER, &
pid-1, 1, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' sent to process ', pid-1, ':'
write(*,*) xlocal(1,:)
end if
if (pid < nproc-1) then
nreq = nreq + 1
call MPI_IRECV(xlocal(lsize+1,:), maxn, MPI_INTEGER, &
pid+1, 1, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' received from process ', pid+1, ':'
write(*,*) xlocal(lsize+1,:)
end if
call MPI_WAITALL(nreq, request, status, ierror)
! check results
errcnt = 0
do i = 1, lsize
do j = 1, maxn
if (xlocal(i,j) /= pid) errcnt = errcnt + 1
end do
end do
do j = 1, maxn
if (xlocal(0,j) /= pid-1) errcnt = errcnt + 1
if ((pid < nproc-1) .and. (xlocal(lsize+1,j) /= pid+1)) errcnt = errcnt + 1
end do
call MPI_REDUCE(errcnt, toterr, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD)
if (pid == root) then
if (toterr == 0) then
write(*,*) "no errors found"
else
write(*,*) "found ", toterr, " errors"
end if
end if
deallocate(xlocal)
deallocate(request)
deallocate(status)
call MPI_FINALIZE(ierror)
but i am running into segmentation faults and can not figure out why. I have a feeling it is due to the request array. can someone explain the correct way of using the request array in Fortran? none of the references I found clarify this.
thx in advance
In case you haven't already done so, consider compiling your program with some flags that will help you in debugging, e.g. with gfortran, you can use -O0 -g -fbounds-check (if that does not help, you might add -fsanitize=address for versions >= 4.8). Other compilers have similar options for debugging.
Doing that, and running with 2 processes, you program crashes at the MPI_Reduce line. If you look up the specifications (e.g. OpenMPI 1.8) you can see that this subroutine requires one more argument, i.e., you forgot to add the ierror argument at the end.
It is a bit tragic that even though the subprograms from the mpi module are accessible through a use association, and thus should be checked for argument consistency to avoid these trivial errors, not all subprograms are necessarily in that module. I don't know which MPI implementation you use, but I checked my local MPICH installation and it does not have most subroutines in the module, so no explicit interfaces exist for them. I guess you are in a similar situation, but I guess other implementations might suffer a similar fate. You could compare it to the C header file missing the function prototype for MPI_Reduce. I guess the reason for this is that originally there was only a Fortran 77 interface for most implementations.
Some final comments: be careful not to just copy-paste the C code. The arrays you pass are not contiguous and will result in a temporary copy to be passed to the MPI routines, which is very inefficient (not that it really matters in this case).
I want to create a parallel program, which makes heavy use of SCALAPACK. The basis of SCALAPACK is BLACS, which itself relies on MPI for interprocess communication.
I want to start the program with a defined number of processes (e.g. the number of cores on the machine) and let the algorithm decide, how to use these processes for calculations.
As a testcase I wanted to use 10 processes. 9 of these processes should get arranged in a grid (BLACS_GRIDINIT) and the 10th process should wait till the other processes are finished.
Unfortunately, OpenMPI crashes because the last process doesn't get into a MPI context from BLACS, while the others did.
Question: What is the correct way to use BLACS with more processes than needed?
I did some experiments with additional MPI_INIT and MPI_FINALIZE calls, but none of my tries were successful.
I started with the sample code from Intel MKL (shortened a little bit):
PROGRAM HELLO
* -- BLACS example code --
* Written by Clint Whaley 7/26/94
* Performs a simple check-in type hello world
* ..
* .. External Functions ..
INTEGER BLACS_PNUM
EXTERNAL BLACS_PNUM
* ..
* .. Variable Declaration ..
INTEGER CONTXT, IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL
INTEGER ICALLER, I, J, HISROW, HISCOL
* Determine my process number and the number of processes in
* machine
CALL BLACS_PINFO(IAM, NPROCS)
* Set up process grid that is as close to square as possible
NPROW = INT( SQRT( REAL(NPROCS) ) )
NPCOL = NPROCS / NPROW
* Get default system context, and define grid
CALL BLACS_GET(0, 0, CONTXT)
CALL BLACS_GRIDINIT(CONTXT, 'Row', NPROW, NPCOL)
CALL BLACS_GRIDINFO(CONTXT, NPROW, NPCOL, MYPROW, MYPCOL)
* If I'm not in grid, go to end of program
IF ( (MYPROW.GE.NPROW) .OR. (MYPCOL.GE.NPCOL) ) GOTO 30
* Get my process ID from my grid coordinates
ICALLER = BLACS_PNUM(CONTXT, MYPROW, MYPCOL)
* If I am process {0,0}, receive check-in messages from
* all nodes
IF ( (MYPROW.EQ.0) .AND. (MYPCOL.EQ.0) ) THEN
WRITE(*,*) ' '
DO 20 I = 0, NPROW-1
DO 10 J = 0, NPCOL-1
IF ( (I.NE.0) .OR. (J.NE.0) ) THEN
CALL IGERV2D(CONTXT, 1, 1, ICALLER, 1, I, J)
END IF
* Make sure ICALLER is where we think in process grid
CALL BLACS_PCOORD(CONTXT, ICALLER, HISROW, HISCOL)
IF ( (HISROW.NE.I) .OR. (HISCOL.NE.J) ) THEN
WRITE(*,*) 'Grid error! Halting . . .'
STOP
END IF
WRITE(*, 3000) I, J, ICALLER
10 CONTINUE
20 CONTINUE
WRITE(*,*) ' '
WRITE(*,*) 'All processes checked in. Run finished.'
* All processes but {0,0} send process ID as a check-in
ELSE
CALL IGESD2D(CONTXT, 1, 1, ICALLER, 1, 0, 0)
END IF
30 CONTINUE
CALL BLACS_EXIT(0)
1000 FORMAT('How many processes in machine?')
2000 FORMAT(I)
3000 FORMAT('Process {',i2,',',i2,'} (node number =',I,
$ ') has checked in.')
STOP
END
Update: I investigated the source code of BLACS to see, what happens there.
The call BLACS_PINFO initializes the MPI context with MPI_INIT, if this didn't happen before. This means, that at this point, everything works as expected.
At the end, the call to BLACS_EXIT(0) should free all resources from BLACS and if the argument is 0, it should also call MPI_FINALIZE. Unfortunately, this doesn't work as expected and my last process doesn't call MPI_FINALIZE.
As a workaround, one could ask MPI_FINALIZED and call MPI_FINALIZE if necessary.
Update 2: My previous tries were done with Intel Studio 2013.0.079 and OpenMPI 1.6.2 on SUSE Linux Enterprise Server 11.
After reading ctheo's answer, I tried to compile this example with the tools given by Ubuntu 12.04 (gfortran 4.6.3, OpenMPI 1.4.3, BLACS 1.1) and was successful.
My conclusion is, that Intel's implementation appears to be buggy. I will retry this example in the not so far future with the newest service release of Intel Studio, but don't expect any changes.
However, I would appreciate any other (and maybe better) solution.
I don't know the answer, and I would hazard a guess that the set of people that participate in SO and those who know the answer to your question is < 1. However, I'd suggest that you might have slightly better luck asking on scicomp or by contacting the ScaLAPACK team at the University of Tennessee directly through their support page. Good luck!
I don't think that you need to do much to use less processes in SCALAPACK.
The BLACS_PINFO subroutine returns the total number of processes.
If you want to use one less, just do NPROCS = NPROCS - 1.
I used your sample code (fixes some typos in FORMAT), added the subtraction and got the following output:
$ mpirun -n 4 ./a.out
Process { 0, 0} (node number = 0) has checked in.
Process { 0, 1} (node number = 1) has checked in.
Process { 0, 2} (node number = 2) has checked in.
All processes checked in. Run finished.
The BLACS_GRIDINIT creates a grid with the reduced NPROCS.
By calling BLACS_GRIDINFO one process gets MYPROW=MYPCOL=-1
On the other hand if you want to create multiple grids that use different processes then probably you should use BLACS_GRIDMAP subroutine. The sample code below creates two equal grids with half size of total processes.
PROGRAM HELLO
* ..
INTEGER CONTXT(2), IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL
INTEGER ICALLER, I, J, HISROW, HISCOL
integer UMAP(2,10,10)
*
CALL BLACS_PINFO(IAM, NPROCS)
NPROCS = NPROCS/2
*
NPROW = INT( SQRT( REAL(NPROCS) ) )
NPCOL = NPROCS / NPROW
*
DO IP = 1, 2
DO I = 1, NPROW
DO J = 1, NPCOL
UMAP(IP,I,J) = (IP-1)*NPROCS+(I-1)*NPCOL+(J-1)
ENDDO
ENDDO
CALL BLACS_GET(0, 0, CONTXT(IP))
CALL BLACS_GRIDMAP(CONTXT(IP), UMAP(IP,:,:), 10, NPROW, NPCOL )
ENDDO
*
DO IP = 1, 2
CALL BLACS_GRIDINFO(CONTXT(IP), NPROW, NPCOL, MYPROW, MYPCOL)
IF(MYPROW.GE.0 .AND. MYPCOL.GE.0 ) THEN
WRITE(*,1000) IAM, MYPROW, MYPCOL, IP
END IF
ENDDO
CALL BLACS_EXIT(0)
1000 FORMAT('Process ',i2,' is (',i2,','i2 ') of grid ',i2)
*
STOP
END
I got the following output:
$ mpirun -n 8 ./a.out
Process 0 is ( 0, 0) of grid 1
Process 1 is ( 0, 1) of grid 1
Process 2 is ( 1, 0) of grid 1
Process 3 is ( 1, 1) of grid 1
Process 4 is ( 0, 0) of grid 2
Process 5 is ( 0, 1) of grid 2
Process 6 is ( 1, 0) of grid 2
Process 7 is ( 1, 1) of grid 2
I did not collect the data in process zero. So you can get this output if all processes are local.