catching a deadlock in a simple odd-even sending - fortran

I'm trying to solve a simple problem with MPI, my implementation is MPICH2 and my code is in fortran. I have used the blocking send and receive, the idea is so simple but when I run it it crashes!!! I have absolutely no idea what is wrong? can anyone make quote on this issue please? there is a piece of the code:
integer, parameter :: IM=100, JM=100
REAL, ALLOCATABLE :: T(:,:), TF(:,:)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,RNK,IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZ,IERR)
prv = rnk-1
nxt = rnk+1
LIM = INT(IM/SIZ)
IF (rnk==0) THEN
ALLOCATE(TF(IM,JM))
prv = MPI_PROC_NULL
ELSEIF(rnk==siz-1) THEN
NXT = MPI_PROC_NULL
LIM = LIM+MOD(IM,SIZ)
END IF
IF (MOD(RNK,2)==0) THEN
CALL MPI_SEND(T(2,:),JM+2,MPI_REAL,PRV,10,MPI_COMM_WORLD,IERR)
CALL MPI_RECV(T(1,:),JM+2,MPI_REAL,PRV,20,MPI_COMM_WORLD,STAT,IERR)
ELSE
CALL MPI_RECV(T(LIM+2,:),JM+2,MPI_REAL,NXT,10,MPI_COMM_WORLD,STAT,IERR)
CALL MPI_SEND(T(LIM+1,:),JM+2,MPI_REAL,NXT,20,MPI_COMM_WORLD,IERR)
END IF
as I understood even processes are not receiving anything while the odd ones finish sending successfully, in some cases when I added some print to observe what is going on I saw that the variable NXT is changing during the sending procedure!!! for example all the odd process was sending message to process 0 not their next one!

The array T is not allocated so reading or writing from it is an error.

I can't see the whole program, but some observation on what I can see:
1) make sure that rnk, size, and prv are integers. Likely, prv is real (by default
typing rules) and you are sending a real to an integer, so tags don't match, hence deadlock.
2) I'd use sendrcv rather than send / recv ; recv / send code section. Two sendrecv's are cleaner ( 2 lines of code vs. 7), guaranteed not to deadlock, and is faster when you have bi-directional links (almost always true.)

Related

Seg fault in fortran MPI_COMM_CREATE_GROUP if using a group not directly created from MPI_COMM_WORLD

I'm having a segmentation fault that I can not really understand in a simple code, that just:
calls the MPI_INIT
duplicates the global communicator, via MPI_COMM_DUP
creates a group with half of processes of the global communicator, via MPI_COMM_GROUP
finally from this group creates a new communicator via MPI_COMM_CREATE_GROUP
Specifically I use this last call, instead of just using MPI_COMM_CREATE, because it's only collective over the group of processes contained in group, while MPI_COMM_CREATE is collective over every process in COMM.
The code is the following
program mpi_comm_create_grp
use mpi
IMPLICIT NONE
INTEGER :: mpi_size, mpi_err_code
INTEGER :: my_comm_dup, mpi_new_comm, mpi_group_world, mpi_new_group
INTEGER :: rank_index
INTEGER, DIMENSION(:), ALLOCATABLE :: rank_vec
CALL mpi_init(mpi_err_code)
CALL mpi_comm_size(mpi_comm_world, mpi_size, mpi_err_code)
!! allocate and fill the vector for the new group
allocate(rank_vec(mpi_size/2))
rank_vec(:) = (/ (rank_index , rank_index=0, mpi_size/2) /)
!! create the group directly from the comm_world: this way works
! CALL mpi_comm_group(mpi_comm_world, mpi_group_world, mpi_err_code)
!! duplicating the comm_world creating the group form the dup: this ways fails
CALL mpi_comm_dup(mpi_comm_world, my_comm_dup, mpi_err_code)
!! creatig the group of all processes from the duplicated comm_world
CALL mpi_comm_group(my_comm_dup, mpi_group_world, mpi_err_code)
!! create a new group with just half of processes in comm_world
CALL mpi_group_incl(mpi_group_world, mpi_size/2, rank_vec,mpi_new_group, mpi_err_code)
!! create a new comm from the comm_world using the new group created
CALL mpi_comm_create_group(mpi_comm_world, mpi_new_group, 0, mpi_new_comm, mpi_err_code)
!! deallocate and finalize mpi
if(ALLOCATED(rank_vec)) DEALLOCATE(rank_vec)
CALL mpi_finalize(mpi_err_code)
end program !mpi_comm_create_grp
If instead of duplicating the COMM_WORLD, I directly create the group from the global communicator (commented line), everything works just fine.
The parallel debugger I'm using traces back the seg fault to a call to MPI_GROUP_TRANSLATE_RANKS, but, as far as I know, the MPI_COMM_DUP duplicates all the attributes of the copied communicator, ranks numbering included.
I am using the ifort version 18.0.5, but I also tried with the 17.0.4, and 19.0.2 with no better results.
Well the thing is a little tricky, at least for me, but after some tests and help, the root of the problem was found.
In the code
CALL mpi_comm_create_group(mpi_comm_world, mpi_new_group, 0, mpi_new_comm, mpi_err_code)
Creates a new communicator for the group mpi_new_group, previously
created. However the mpi_comm_world, which is used as first argument, is not in the same context as mpi_new_group, as explained in the mpich reference:
MPI_COMM_DUP will create a new communicator over the same group as
comm but with a new context
So the correct call would be:
CALL mpi_comm_create_group(my_comm_copy, mpi_new_group, 0, mpi_new_comm, mpi_err_code)
I.e. , replacing the mpi_comm_world for my_comm_copy, that is the one from which the mpi_group_world was created.
I am still not sure why it is working with OpenMPI, but it is generally more tolerant
with this sort of things.
Like suggested in the comments I wrote to openmpi user list, and they replied
That is perfectly valid. The MPI processes that make up the group are all part of comm world. I would file a bug with Intel MPI.
So I try and post a question on Intel forum.
It is a bug they solved in the last version of the libray, 19.3.

Coarray deadlock in do-cycle-exit

A strange phenomenon occurs in the following Coarray code
program strange
implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]
co_missionAccomplished = .false.
sync all
do
if (this_image()==1) then
counter = counter+1
if (counter==2) co_missionAccomplished = .true.
sync images(*)
else
sync images(1)
end if
if (co_missionAccomplished[1]) exit
cycle
end do
write(*,*) "missionAccomplished on image ", this_image()
end program strange
This program never ends, as it appears that there is a deadlock for any counter threshold beyond 1 inside the loop. The code is compiled with Intel Fortran 2018 Windows OS, with the following flags:
ifort /debug /Qcoarray=shared /standard-semantics /traceback /gen-interfaces /check /fpe:0 normal.f90 -o run.exe
The same code, using DO WHILE construct, also appears to suffer from the same phenomenon:
program strange
implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]
co_missionAccomplished = .true.
sync all
do while(co_missionAccomplished[1])
if (this_image()==1) then
counter = counter+1
if (counter==2) co_missionAccomplished = .false.
sync images(*)
else
sync images(1)
end if
end do
write(*,*) "missionAccomplished on image ", this_image()
end program strange
This seems now too trivial to be a compiler bug, so I am probably missing something important about do-loops in parallel. any help is appreciated.
UPDATE:
Adding a SYNC ALL statement before the CYCLE statement in the DO-CYCLE-EXIT example program above resolves the deadlock. Also, a SYNC ALL statement right after DO WHILE statement, as the first line of the block resolves the deadlock. So apparently, all the images must be synced to avoid a deadlock before each cycle of the loops in either case above.
Regarding "This seems now too trivial to be a compiler bug", you may be very surprised about how seemingly trivial things can be treated incorrectly by a compiler. Few things relating to coarrays are trivial.
Consider the following program which is related:
implicit none
integer i[*]
do i=1,1
sync all
print '(I1)', i[1]
end do
end
I get the initially surprising output
1
2
when run with two images under ifort 2018.1.
Let's look at what's going on.
In my loop, i[1] first has value 1 when the images are synchronized. However, by the time the second image accesses the value, it's been changed by the first image ending its iteration.
We solve that little problem by putting an extra synchronization statement before the end do.
How is this program related to the one of the question? It's the same lack of synchronization between testing a value on a remote image and that image updating it.
Between the synchronization and other images testing the value of co_missionAccomplished[1], the first image may dash around and update counter, then co_missionAccomplished. Some images may see the exit state in their first iteration.

Retrospectively closing a NetCDF file created with Fortran

I'm running a distributed model stripped to its bare minimum below:
integer, parameter :: &
nx = 1200,& ! Number of columns in grid
ny = 1200,& ! Number of rows in grid
nt = 6000 ! Number of timesteps
integer :: it ! Loop counter
real :: var1(nx,ny), var2(nx,ny), var3(nx,ny), etc(nx,ny)
! Create netcdf to write model output
call check( nf90_create(path="out.nc",cmode=nf90_clobber, ncid=nc_out_id) )
! Loop over time
do it = 1,nt
! Calculate a lot of variables
...
! Write some variables in out.nc at each timestep
CALL check( nf90_put_var(ncid=nc_out_id, varid=var1_varid, values=var1, &
start = (/ 1, 1, it /), count = (/ nx, ny, 1 /)) )
! Close the netcdf otherwise it is not readable:
if (it == nt) call check( nf90_close(nc_out_id) )
enddo
I'm in the development stage of the model so, it inevitably crashes at unexpected points (usually at the Calculate a lot of variables stage), which means that, if the model crashes at timestep it =3000, 2999 timesteps will be written to the netcdf output file, but I will not be able to read the file because the file has not been closed. Still, the data have been written: I currently have a 2GB out.nc file that I can't read. When I ncdump the file it shows
netcdf out.nc {
dimensions:
x = 1400 ;
y = 1200 ;
time = UNLIMITED ; // (0 currently)
variables:
float var1 (time, y, x) ;
data:
}
My questions are: (1) Is there a way to close the file retrospectively, even outside Fortran, to be able to read the data that have already been written? (2) Alternatively, is there another way to write the file in Fortran that would make the file readable even without closing it?
When nf90_close is called, buffered output is written to disk and the file ID is relinquished so it can be reused. The problem is most likely due to buffered output not having been written to the disk when the program terminates due to a crash, meaning that only the changes you made in "define mode" are present in the file (as shown by ncdump).
You therefore need to force the data to be written to the disk more often. There are three ways of doing this (as far as I am aware).
nf90_sync - which synchronises the buffered data to disk when called. This gives you the most control over when to output data (every loop step, or every n loop steps, for example), which can allow you to optimize for speed vs robustness, but introduces more programming and checking overhead for you.
Thanks to #RussF for this idea. Creating or opening the file using the nf90_share flag. This is the recommended approach if the netCDF file is intended to be used by multiple readers/writers simultaneously. It is essentially the same as an automatic implementation of nf90_sync for writing data. It gives less control, but also less programming overhead. Note that:
This only applies to netCDF-3 classic or 64-bit offset files.
Finally, an option I wouldn't recommend, but am including for completeness (and I guess there may be situations where this is the best option, although none spring to mind) - closing and reopening the file. I don't recommend this, because it will slow down your program, and adds greater possibility of causing errors.

Getting result of a spawned function in Erlang

My objective at the moment is to write Erlang code calculating a list of N elements, where each element is a factorial of it's "index" (so, for N = 10 I would like to get [1!, 2!, 3!, ..., 10!]). What's more, I would like every element to be calculated in a seperate process (I know it is simply inefficient, but I am expected to implement it and compare its efficiency with other methods later).
In my code, I wanted to use one function as a "loop" over given N, that for N, N-1, N-2... spawns a process which calculates factorial(N) and sends the result to some "collecting" function, which packs received results into a list. I know my concept is probably overcomplicated, so hopefully the code will explain a bit more:
messageFactorial(N, listPID) ->
listPID ! factorial(N). %% send calculated factorial to "collector".
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
nProcessesFactorialList(-1) ->
ok;
nProcessesFactorialList(N) ->
spawn(pFactorial, messageFactorial, [N, listPID]), %%for each N spawn...
nProcessesFactorialList(N-1).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
listPrepare(List) -> %% "collector", for the last factorial returns
receive %% a list of factorials (1! = 1).
1 -> List;
X ->
listPrepare([X | List])
end.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
startProcessesFactorialList(N) ->
register(listPID, spawn(pFactorial, listPrepare, [[]])),
nProcessesFactorialList(N).
I guess it shall work, by which I mean that listPrepare finally returns a list of factorials. But the problem is, I do not know how to get that list, how to get what it returned? As for now my code returns ok, as this is what nProcessesFactorialList returns at its finish. I thought about sending the List of results from listPrepare to nProcessesFactorialList in the end, but then it would also need to be a registered process, from which I wouldn't know how to recover that list.
So basically, how to get the result from a registered process running listPrepare (which is my list of factorials)? If my code is not right at all, I would ask for a suggestion of how to get it better. Thanks in advance.
My way how to do this sort of tasks is
-module(par_fact).
-export([calc/1]).
fact(X) -> fact(X, 1).
fact(0, R) -> R;
fact(X, R) when X > 0 -> fact(X-1, R*X).
calc(N) ->
Self = self(),
Pids = [ spawn_link(fun() -> Self ! {self(), {X, fact(X)}} end)
|| X <- lists:seq(1, N) ],
[ receive {Pid, R} -> R end || Pid <- Pids ].
and result:
> par_fact:calc(25).
[{1,1},
{2,2},
{3,6},
{4,24},
{5,120},
{6,720},
{7,5040},
{8,40320},
{9,362880},
{10,3628800},
{11,39916800},
{12,479001600},
{13,6227020800},
{14,87178291200},
{15,1307674368000},
{16,20922789888000},
{17,355687428096000},
{18,6402373705728000},
{19,121645100408832000},
{20,2432902008176640000},
{21,51090942171709440000},
{22,1124000727777607680000},
{23,25852016738884976640000},
{24,620448401733239439360000},
{25,15511210043330985984000000}]
The first problem is that your listPrepare process doesn't do anything with the result. Try to print it in the end.
The second problem is that you don't wait for all the processes to finish, but for process that sends 1 and this is the quickest factorial to calculate. So this message will surely be received before the more complex will be calculated, and you'll end up with only a few responses.
I had answered a bit similar question on the parallel work with many processes here: Create list across many processes in Erlang Maybe that one will help you.
I propose you this solution:
-export([launch/1,fact/2]).
launch(N) ->
launch(N,N).
% launch(Current,Total)
% when all processes are launched go to the result collect phase
launch(-1,N) -> collect(N+1);
launch(I,N) ->
% fact will be executed in a new process, so the normal way to get the answer is by message passing
% need to give the current process pid to get the answer back from the spawned process
spawn(?MODULE,fact,[I,self()]),
% loop until all processes are launched
launch(I-1,N).
% simply send the result to Pid.
fact(N,Pid) -> Pid ! {N,fact_1(N,1)}.
fact_1(I,R) when I < 2 -> R;
fact_1(I,R) -> fact_1(I-1,R*I).
% init the collect phase with an empty result list
collect(N) -> collect(N,[]).
% collect(Remaining_result_to_collect,Result_list)
collect(0,L) -> L;
% accumulate the results in L and loop until all messages are received
collect(N,L) ->
receive
R -> collect(N-1,[R|L])
end.
but a much more straight (single process) solution could be:
1> F = fun(N) -> lists:foldl(fun(I,[{X,R}|Q]) -> [{I,R*I},{X,R}|Q] end, [{0,1}], lists:seq(1,N)) end.
#Fun<erl_eval.6.80484245>
2> F(6).
[{6,720},{5,120},{4,24},{3,6},{2,2},{1,1},{0,1}]
[edit]
On a system with multicore, cache and an multitask underlying system, there is absolutly no guarantee on the order of execution, same thing on message sending. The only guarantee is in the message queue where you know that you will analyse the messages according to the order of message reception. So I agree with Dmitry, your stop condition is not 100% effective.
In addition, using startProcessesFactorialList, you spawn listPrepare which collect effectively all the factorial values (except 1!) and then simply forget the result at the end of the process, I guess this code snippet is not exactly the one you use for testing.

mpi_waitall in mpich2 with null values in array_of_requests

I get the following error with MPICH-2.1.5 and PGI compiler;
Fatal error in PMPI_Waitall: Invalid MPI_Request, error stack:
PMPI_Waitall(311): MPI_Waitall(count=4, req_array=0x2ca0ae0, status_array=0x2c8d220) failed
PMPI_Waitall(288): The supplied request in array element 0 was invalid (kind=0)
in the following example Fortran code for a stencil based algorithm,
Subroutine data_exchange
! data declaration
integer request(2*neighbor),status(MPI_STATUS_SIZE,2*neighbor)
integer n(neighbor),iflag(neighbor)
integer itag(neighbor),neigh(neighbor)
! Data initialization
request = 0; n = 0; iflag = 0;
! Create data buffers to send and recv
! Define values of n,iflag,itag,neigh based on boundary values
! Isend/Irecv look like this
ir=0
do i=1,neighbor
if(iflag(i).eq.1) then
ir=ir+1
call MPI_Isend(buf_send(i),n(i),MPI_REAL,neigh(i),itag(i),MPI_COMM_WORLD,request(ir),ierr)
ir=ir+1
call MPI_Irecv(buf_recv(i),nsize,MPI_REAL,neigh(i),MPI_ANY_TAG,MPI_COMM_WORLD,request(ir),ierr)
endif
enddo
! Calculations
call MPI_Waitall(2*neighbor,request,status,ierr)
end subroutine
The error occurs when the array_of_request in mpi_waitall gets a null value (request(i)=0). The null value in array_of_request comes up when the conditional iflag(i)=1 is not satisfied. The straight forward solution is to comment out the conditional but then that would introduce overheads of sending and receiving messages of 0 sizes which is not feasible for large scale systems (1000s of cores).
As per the MPI-forum link, the array_of_requests list may contain null or inactive handles.
I have tried following,
not initializing array_of_requests,
resizing array_of_request to match the MPI_isend + MPI_irecv count,
assigning dummy values to array_of_request
I also tested the very same code with MPICH-1 as wells as OpenMPI 1.4 and the code works without any issue.
Any insights would be really appreciated!
You could just move the first increment of ir into the conditional as well. Then you would have all handles in request(1:ir) at the and of the loop and issue:
call MPI_Waitall(ir,request(1:ir),status(:,1:ir),ierr)
This would make sure all requests are initialized properly.
Another thing: does n(i) in MPI_Isend hold the same value as nsize in the corresponding MPI_Irecv?
EDIT:
After consulting the MPI Standard (3.0, Ch. 3.7.3) I think you need to initialize the request array to MPI_REQUEST_NULL if you want give the whole request array to MPI_Waitall.