SIGBUS occurs when fortran code reads file on linux cluster - fortran

When I run my fortran code in parallel on a linux cluster with mpirun I get a sigbus error.
It occurs while reading a file, the timing is irregular, and sometimes it proceeds without error.
I have tried debug compilation options like -g, but I haven't gotten any information on what line the error is coming from.
Actually the code was executed previously in three different clusters without this error, but the error is only occurring on this machine.
I personally suspect this is related to the performance of the machine (especially storage i/o), but I am not sure.
The program code is simple. Each process executed by mpirun reads the file corresponding to its rank as follows.
!!!!!!!!!! start of code
OPEN(11, FILE='FILE_NAME_WITH_RANK', FORM='UNFORMATTED')
READ(11,*) ISIZE
ALLOCATE(SOME_VARIABLE(ISIZE))
DO I = 1, ISIZE
READ(11,*) SOME_VARIABLE(I)
ENDDO
READ(11,*) ISIZE2
ALLOCATE(SOME_VARIABLE2(ISIZE2))
DO I = 1, ISIZE2
READ(11,*) SOME_VARIABLE2(I)
ENDDO
! MORE VARIABLES
CLOSE(11)
!!!!!!!!!! end of code
I used 191 cpu, and the total size of 191 files it loads is about 11 GB.
The cluster used for execution consists of 24 nodes with 16 cpu each (384 cpu total) and uses common storage that is shared with another cluster.
I ran the code in parallel by specifying nodes 1 through 12 as the hostfile.
Initially, I had 191 cpu read all files at the same time out of sequence.
After doing so, the program ended with a sigbus error. Also, for some nodes, the ssh connection was delayed, and the bashrc file cannot be found by node with stale file handle error.
The stale file handle error waited a bit and it seemed to recover by itself, but I'm not sure what the system administrator did.
So, I changed it to the following code so that only one cpu can read the file at a time.
!!!!!!!!!! start of code
DO ICPU = 0, NUMBER_OF_PROCESS-1
IF(ICPU.EQ.MY_PROCESS) CALL READ_FILE
CALL MPI_BARRIER(MPI_COMMUNICATOR,IERR)
ENDDO
!!!!!!!!!! end of code
This seemed to work fine for single execution, but if I ran more than one of these programs at the same time, the first mpirun stopped and both ended with a sigbus error eventually.
My next attempt is to minimize the execution of the read statement by deleting the do statement when reading the array. However, due to limited time, I couldn't test the effectiveness of this modification.
Here are some additional information.
If I execute a search or copy a file with an explorer such as nautilus while running a parallel program, nautilus does not respond or the running program raise sigbus. In severe cases, I wasn't able to connect the VNC server with stale file handle errors.
I use OpenMPI 2.1.1, GNU Fortran 4.9.4.
I compile the program with following
$OPENMPIHOME/bin/mpif90 -mcmodel=large -fmax-stack-var-size-64 -cpp -O3 $SOURCE -o $EXE
I execute the program with following in gnome terminal
$OPENMPIHOME/bin/mpirun -np $NP -x $LD_LIBRARY_PATH --hostfile $HOSTFILE $EXE
The cluster is said to be running commercial software like FLUENT without problems.
Summing up the above, my personal suspicion is that the storage of the cluster is dismounted due to the excessive disk I/O generated by my code, but I don't know if this makes sense because I have no cluster knowledge.
If yes, I wonder if there is a way to minimize the disk I/O, if it is enough to proceed with the vectorized I/O mentioned above, or if there is an additional part.
I would appreciate it if you could tell me anything about the problem.
Thanks in advance.
!!!
I wrote an example code. As mentioned above, it may not be easy to reproduce because the occurrence varies depending on the machine.
PROGRAM BUSWRITE
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
ARRAY1(I) = I
WRITE(11) ARRAY1(I)
ENDDO
DO I = 1, ISIZE2
ARRAY2(I) = I**2
WRITE(11) ARRAY2(I)
ENDDO
DO I = 1, ISIZE3
ARRAY3(I) = I**3
WRITE(11) ARRAY3(I)
ENDDO
CLOSE(11)
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./buswrite.f90 -o ./buswrite
mpirun -np 32 ./buswrite
I've got 32 000.DAT ~ 031.DAT
PROGRAM BUSREAD
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
READ(11) ARRAY1(I)
IF(ARRAY1(I).NE.I) STOP
ENDDO
DO I = 1, ISIZE2
READ(11) ARRAY2(I)
IF(ARRAY2(I).NE.I**2) STOP
ENDDO
DO I = 1, ISIZE3
READ(11) ARRAY3(I)
IF(ARRAY3(I).NE.I**3) STOP
ENDDO
CLOSE(11)
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
IF(MYPE.EQ.0) WRITE(*,*) 'GOOD'
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./busread.f90 -o ./busread
mpirun -np 32 ./busread
I've got 'GOOD' output text from terminal as expected, but the machine in question is terminated with a sigbus error while running busread.

The issue was not observed after a device reboot. Even though I ran 4 programs at the same time under the same conditions, no problem occurred. In addition, other teams that used the device also had similar problems, which were resolved after reboot. The conclusion is a bit ridiculous, but if there are any people experiencing similar problems, I would like to summarize it as follows.
If your program terminates abnormally due to a memory error (like sigbus and sigsegv) while reading or writing a file, you can check the following.
Make sure there are no errors in your program. Check whether the time of occurrence of the error is constant or irregular, whether other programs have the same symptoms, whether it runs well on other machines, and whether there is a problem when run with a memory error checking tool such as valgrind.
Optimize the file I/O part. In the case of fortran, processing an entire array is tens of times faster than processing by element.
Immediately after an error occurs, try ssh connection to the machine (or node) to check whether the connection is smooth and that the file system is well accessed. If you cannot access the bashrc file or an error such as stale file handle occurs, please contact the system manager after combining the above reviewed information.
If someone has anything to add or if this post isn't appropriate, please let me know.

Related

Run the same code with different parameterizations on multiple nodes of a slurm cluster

I have a fortran code with 3 scenarios.
I set a flag at the beginning of the code for which scenario I want to run.
integer :: scenario_no = 1 !Set 1, 2 or 3
I usually manually change this flag, compile the code, and run it into a cluster node.
Is there anyway to create a sbatch file to run each of the 3 scenarios on a different note without having to recompile each time?
I recommend reading in the command line arguments to the program.
integer :: clen, status, scenario_no
character(len=4) :: buffer
! Body of runcfdsim
scenario_no = 1
clen = command_argument_count()
if(clen>0) then
call get_command_argument (1, buffer, clen, status)
if(status==0) then
read(buffer, '(BN,I4)') scenario_no
if(scenario_no<1 .or. scenario_no>4) then
scenario_no = 1
end if
end if
end if
This way you can call
runcfdsim 1
runcfdsim 2
runcfdsim 3
runcfdsim 4
and each run will have a different value for the scenario_no variable.
No, if the variable is hard-coded, you have to recompile every time you change it.
EDIT: As pointed out in the comment, reading from a file may lead to problems when running in parallel. Better read from argument:
program test
character :: arg
integer :: scenario
if(command_argument_count() >= 1) then
call get_command_argument(1, arg)
read(unit=arg, fmt=*, iostat=ios) scenario
if(ios /= 0) then
write(*,"('Invalid argument: ',a)") arg
stop
end if
write(*,"('Running for scenario: ',i0)") scenario
else
write(*,"('Invalid argument')")
end if
end program test
srun test.exe 1 &
srun test.exe 2 &
srun test.exe 3 &
wait
Make sure the outputs are done in different files, so they are not overwritten. You may also need to pin the tasks on different cores.

Coarray deadlock in do-cycle-exit

A strange phenomenon occurs in the following Coarray code
program strange
implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]
co_missionAccomplished = .false.
sync all
do
if (this_image()==1) then
counter = counter+1
if (counter==2) co_missionAccomplished = .true.
sync images(*)
else
sync images(1)
end if
if (co_missionAccomplished[1]) exit
cycle
end do
write(*,*) "missionAccomplished on image ", this_image()
end program strange
This program never ends, as it appears that there is a deadlock for any counter threshold beyond 1 inside the loop. The code is compiled with Intel Fortran 2018 Windows OS, with the following flags:
ifort /debug /Qcoarray=shared /standard-semantics /traceback /gen-interfaces /check /fpe:0 normal.f90 -o run.exe
The same code, using DO WHILE construct, also appears to suffer from the same phenomenon:
program strange
implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]
co_missionAccomplished = .true.
sync all
do while(co_missionAccomplished[1])
if (this_image()==1) then
counter = counter+1
if (counter==2) co_missionAccomplished = .false.
sync images(*)
else
sync images(1)
end if
end do
write(*,*) "missionAccomplished on image ", this_image()
end program strange
This seems now too trivial to be a compiler bug, so I am probably missing something important about do-loops in parallel. any help is appreciated.
UPDATE:
Adding a SYNC ALL statement before the CYCLE statement in the DO-CYCLE-EXIT example program above resolves the deadlock. Also, a SYNC ALL statement right after DO WHILE statement, as the first line of the block resolves the deadlock. So apparently, all the images must be synced to avoid a deadlock before each cycle of the loops in either case above.
Regarding "This seems now too trivial to be a compiler bug", you may be very surprised about how seemingly trivial things can be treated incorrectly by a compiler. Few things relating to coarrays are trivial.
Consider the following program which is related:
implicit none
integer i[*]
do i=1,1
sync all
print '(I1)', i[1]
end do
end
I get the initially surprising output
1
2
when run with two images under ifort 2018.1.
Let's look at what's going on.
In my loop, i[1] first has value 1 when the images are synchronized. However, by the time the second image accesses the value, it's been changed by the first image ending its iteration.
We solve that little problem by putting an extra synchronization statement before the end do.
How is this program related to the one of the question? It's the same lack of synchronization between testing a value on a remote image and that image updating it.
Between the synchronization and other images testing the value of co_missionAccomplished[1], the first image may dash around and update counter, then co_missionAccomplished. Some images may see the exit state in their first iteration.

OpenACC error when running programs of higher magnitude

Using the following code, is it correct? I have 2GB Geforce 750M and using the PGI Fortran compiler. The program works fine for 4000x4000 arrays, anything higher it complains even though it should not, You can see i have allocated a 9000x9000 array but if i use a n value > 4000 it complains and throws a runtime error.
program matrix_multiply
!use openacc
implicit none
integer :: i,j,k,n
real, dimension(9000,9000) :: a, b, c
real x_scalar
real x_vector(2)
n=5000
call random_number (b)
call random_number (a)
!$acc kernels
do k = 1,n
do i = 1,n
do j = 1,n
c(i,k) = c(i,k) + a(i,j) * b(j,k)
enddo
enddo
enddo
!$acc end kernels
end program matrix_multiply
Thanks to Robert Crovella
My guess is that there is some sort of display timeout on the mac (also here) As you increase to a larger size, the matrix multiply kernel takes longer. At some point the display driver timeout in the Mac OS resets the GPU. If that is the case, you could work around it by switching to a system/GPU where the GPU is not hosting a display. Both Linux and Windows (TDR) also have such timeout mechanisms.
You have to boot into >console mode in Mac OS and also disable automatic graphic switching as the console mode turns off Aqua (GUI in Mac) and thus is supposed to remove the limitation.

Retrieve data from file written in FORTRAN during program run

I am trying to write a series of values for time (real values) into a dat file in FORTRAN. This is a part of an MPI code and the code runs for a long time. So I would like to extract data at every time step and print it into a file and read the file any time during the execution of the program. Currently, the problem I am facing is, the values of time are not written into the file until the program ends. I have put the open statement before the do loop and the close statement after the end of do loop.
The parts of my code look like:
open(unit=57,file='inst.dat')
do loop starts
.
.
.
write(57,*) time
.
.
.
end do
close(57)
try call flush(unit). Check your compiler docs as this is i think an extension.
You mention MPI: For parallel codes I think you need to give each thread its own file/unit,
or take other measures to avoid conflicts.
From Gfortran manual:
Beginning with the Fortran 2003 standard, there is a FLUSH statement that should be preferred over the FLUSH intrinsic.
The FLUSH intrinsic and the Fortran 2003 FLUSH statement have identical effect: they flush the runtime library's I/O buffer so that the data becomes visible to other processes. This does not guarantee that the data is committed to disk.
On POSIX systems, you can request that all data is transferred to the storage device by calling the fsync function, with the POSIX file descriptor of the I/O unit as argument (retrieved with GNU intrinsic FNUM). The following example shows how:
! Declare the interface for POSIX fsync function
interface
function fsync (fd) bind(c,name="fsync")
use iso_c_binding, only: c_int
integer(c_int), value :: fd
integer(c_int) :: fsync
end function fsync
end interface
! Variable declaration
integer :: ret
! Opening unit 10
open (10,file="foo")
! ...
! Perform I/O on unit 10
! ...
! Flush and sync
flush(10)
ret = fsync(fnum(10))
! Handle possible error
if (ret /= 0) stop "Error calling FSYNC"
How about closing the file after every time step (assuming a reasonable amount of time elapses between time steps)?
do loop starts
.
.
!Note: an if statement should wrap the following so that it is
!only called by one processor.
open(unit=57,file='inst.dat')
write(57,*) time
close(57)
.
.
end do
Alternatively if the time between time steps is short, writing the data after blocks of 10, 100, ... iterations may be more efficient.

mpirun is not working with two nodes

I am working in a cluster where each node has 16 processors. My version of Open MPI is
1.5.3. I have written the following simple code in fortran:
program MAIN
implicit none
include 'mpif.h'
integer status(MPI_STATUS_SIZE)
integer ierr,my_rank,size
integer irep, nrep, iex
character*1 task
!Initialize MPI
call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,my_rank,ierr)
call mpi_comm_size(MPI_COMM_WORLD,size,ierr)
do iex=1,2
if(my_rank.eq.0) then
!Task for the master
nrep = size
do irep=1,nrep-1
task='q'
print *, 'master',iex,task
call mpi_send(task,1,MPI_BYTE,irep,irep+1,
& MPI_COMM_WORLD,ierr)
enddo
else
!Here are the tasks for the slaves
!Receive the task sent by the master node
call mpi_recv(task,1,MPI_BYTE,0,my_rank+1,
& MPI_COMM_WORLD,status,ierr)
print *, 'slaves', my_rank,task
endif
enddo
call mpi_finalize(ierr)
end
then I compile the code with:
/usr/lib64/openmpi/bin/mpif77 -o test2 test2.f
and run it with
/usr/lib64/openmpi/bin/mpirun -np 32 -hostfile nodefile test2
my nodefile looks like this:
node1
node1
...
node2
node2
...
with node1 and node2 repeated 16 times each.
I can compile successfully. When I run it for -np 16 (so just one node) it works
fine: each slave finishes its task and I get the prompt back in the terminal. But when I try -np 32, not all the slaves finish
their work, only 16 of them.
Actually with 32 nodes the program doesn't give me the
prompt back, so that I think the program is stacked somewhere and is waiting for
some task to be perform.
I would like to receive any comment from you as far as I have spent some time in this
trivial problem.
Thanks.
I'm not sure that your nodefile is correct. I'd expect to see lines like this:
node1 slots=16
OpenMPI is pretty well-documented, have you checked out their FAQ ?
did you try mpiexec instead of mpirun?