I have a Fortran code that produces an unformatted file (.checkpoint). In the same code, I am trying to assign a value to a variable if the unformatted file exists, before the code creates it later. Here is part of the code:
integer :: nt, irec1
logical OK
nt = 1
a = 0
inquire (file='.checkpoint', exist=OK)
if (OK) then
a = 1
end if
if(a .eq. 1) then
inquire (iolength = irec1) nt
open(unit=255, file='.checkpoint', status='unknown', form='unformatted', access='direct', recl=irec1)
read(255,rec=1) nt
go to 315
end if
!do something here
315 nt = 2
!Saving checkpoint file
inquire (iolength = irec1) nt
open(unit=255, file='.checkpoint', status='unknown', form='unformatted', access='direct', recl=irec1)
write(255,rec=1) nt
stop
end
For the first run, OK should be false, and a = 0, since there is no .checkpoint file. If run the code using ifort compiler, I obtain a error message:
forrtl: severe (36): attempt to access non-existent record, unit 255 ~/directory/.checkpoint
My question is how to first inquire if an unformatted file exists. If exists then assign a = 1. The above code works fine if the file is formatted. The main aim of this code is to restart calculation for molecular dynamics simulation.
Related
This question already has an answer here:
How to flush stdout in Fortran 90?
(1 answer)
Closed 10 months ago.
When I write the following .dat file, the file is only 6397 lines long even though the nested do loops correctly iterate 6400 times. Also, when I output the data NX(BR,BC) to Terminal, everything is fine. Only the .dat file is missing a few lines. Any idea what the issue might be?
Note NX is defined as REAL(8),DIMENSION(0:2000,0:2000) :: NX
Here's the rest of the relevant part of the code:
OPEN(UNIT=18,FILE='test_file.dat',STATUS='UNKNOWN',ACTION='WRITE')
I = 0 ! for debugging
DO BC = 0,39
DO BR = 0,159
! do some calculations here to calculate NX(BR,BC)
! for debugging
I = I + 1
WRITE(18,*) NX(BR,BC)
ENDDO
ENDDO
WRITE(*,*) I ! outputs 6400
I tested your code with my raspberry pi 4 (GNU Fortran (GCC) 11.2.0) and got exactly 6400 lines of zeros.
Maybe the write buffer is not flushed correctly. Have you tried closing and re-opening the file: How to carriage return and flush in Fortran?
close(18)
Here is a snippet of simple code that reads line from file, then returns to previous position and re-reads same line:
program main
implicit none
integer :: unit, pos, stat
character(128) :: buffer
! Open file as formatted stream
open( NEWUNIT=unit, FILE="data.txt", ACCESS="stream", FORM="formatted", STATUS="old", ACTION="read", IOSTAT=stat )
if ( stat /= 0 ) error stop
! Skip 2 lines
read (unit,*) buffer
read (unit,*) buffer
! Store position
pos = ftell(unit)
! Read & write next line
read (unit,*) buffer
write (*,*) "buffer=", trim(buffer)
! Return to previous position
call fseek(unit,pos,0)
! pos = ftell(unit) ! <-- ?!
! Read & write next line (should be same output)
read (unit,*) buffer
write (*,*) "buffer=", trim(buffer)
! Close file stream
close (UNIT=unit)
end program main
The "data.txt" is just a dummy file with 4 lines:
1
2
3
4
Now when I compile the snippet (gfortran 9.3.0) and run it, I get an answer:
buffer=3
buffer=4
which is wrong, as they should be same. More interestingly when I add an additional ftell (commented line in the snippet) after 'fseek' I get correct answer:
buffer=3
buffer=3
Any idea why it does that? or am I using ftell and fseek incorrectly?
gfortran's documentation for FTELL and FSEEK clearly states that these routines are provided for backwards compatibility with g77. As your code is using NEWUNIT, ERROR STOP, and STREAM access, you are not compiling old moldy code. You ought to use standard conforming methods as pointed out by #Vladimir.
A quick debugging session shows that FTELL and FSEEK are using a 0-based reference for the file position while the inquire method of modern Fortran is 1 based. There could be an off-by-one type bug in gfortran, but as FTELL and FSEEK are for backwards compatibility with g77 (an unmaintained 15+ year old compiler), someone would need to do some code spelunking to determine the intended behavior. I suspect none of the current, active, gfortran developers care enough to explore the problem. So, to fix your problem
program main
implicit none
integer pos, stat, unit
character(128) buffer
! Open file as formatted stream
open(NEWUNIT=unit, FILE="data.txt", ACCESS="stream", FORM="formatted", &
& STATUS="old", ACTION="read", IOSTAT=stat)
if (stat /= 0) stop
! Skip 2 lines
read (unit,*) buffer
read (unit,*) buffer
! Store position
inquire(unit, pos=pos)
! Read & write next line
read (unit,*) buffer
write (*,*) "buffer=", trim(buffer)
! Reread & write line (should be same output)
read (unit,*,pos=pos) buffer
write (*,*) "buffer=", trim(buffer)
! Close file stream
close (UNIT=unit)
end program main
When I run my fortran code in parallel on a linux cluster with mpirun I get a sigbus error.
It occurs while reading a file, the timing is irregular, and sometimes it proceeds without error.
I have tried debug compilation options like -g, but I haven't gotten any information on what line the error is coming from.
Actually the code was executed previously in three different clusters without this error, but the error is only occurring on this machine.
I personally suspect this is related to the performance of the machine (especially storage i/o), but I am not sure.
The program code is simple. Each process executed by mpirun reads the file corresponding to its rank as follows.
!!!!!!!!!! start of code
OPEN(11, FILE='FILE_NAME_WITH_RANK', FORM='UNFORMATTED')
READ(11,*) ISIZE
ALLOCATE(SOME_VARIABLE(ISIZE))
DO I = 1, ISIZE
READ(11,*) SOME_VARIABLE(I)
ENDDO
READ(11,*) ISIZE2
ALLOCATE(SOME_VARIABLE2(ISIZE2))
DO I = 1, ISIZE2
READ(11,*) SOME_VARIABLE2(I)
ENDDO
! MORE VARIABLES
CLOSE(11)
!!!!!!!!!! end of code
I used 191 cpu, and the total size of 191 files it loads is about 11 GB.
The cluster used for execution consists of 24 nodes with 16 cpu each (384 cpu total) and uses common storage that is shared with another cluster.
I ran the code in parallel by specifying nodes 1 through 12 as the hostfile.
Initially, I had 191 cpu read all files at the same time out of sequence.
After doing so, the program ended with a sigbus error. Also, for some nodes, the ssh connection was delayed, and the bashrc file cannot be found by node with stale file handle error.
The stale file handle error waited a bit and it seemed to recover by itself, but I'm not sure what the system administrator did.
So, I changed it to the following code so that only one cpu can read the file at a time.
!!!!!!!!!! start of code
DO ICPU = 0, NUMBER_OF_PROCESS-1
IF(ICPU.EQ.MY_PROCESS) CALL READ_FILE
CALL MPI_BARRIER(MPI_COMMUNICATOR,IERR)
ENDDO
!!!!!!!!!! end of code
This seemed to work fine for single execution, but if I ran more than one of these programs at the same time, the first mpirun stopped and both ended with a sigbus error eventually.
My next attempt is to minimize the execution of the read statement by deleting the do statement when reading the array. However, due to limited time, I couldn't test the effectiveness of this modification.
Here are some additional information.
If I execute a search or copy a file with an explorer such as nautilus while running a parallel program, nautilus does not respond or the running program raise sigbus. In severe cases, I wasn't able to connect the VNC server with stale file handle errors.
I use OpenMPI 2.1.1, GNU Fortran 4.9.4.
I compile the program with following
$OPENMPIHOME/bin/mpif90 -mcmodel=large -fmax-stack-var-size-64 -cpp -O3 $SOURCE -o $EXE
I execute the program with following in gnome terminal
$OPENMPIHOME/bin/mpirun -np $NP -x $LD_LIBRARY_PATH --hostfile $HOSTFILE $EXE
The cluster is said to be running commercial software like FLUENT without problems.
Summing up the above, my personal suspicion is that the storage of the cluster is dismounted due to the excessive disk I/O generated by my code, but I don't know if this makes sense because I have no cluster knowledge.
If yes, I wonder if there is a way to minimize the disk I/O, if it is enough to proceed with the vectorized I/O mentioned above, or if there is an additional part.
I would appreciate it if you could tell me anything about the problem.
Thanks in advance.
!!!
I wrote an example code. As mentioned above, it may not be easy to reproduce because the occurrence varies depending on the machine.
PROGRAM BUSWRITE
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
ARRAY1(I) = I
WRITE(11) ARRAY1(I)
ENDDO
DO I = 1, ISIZE2
ARRAY2(I) = I**2
WRITE(11) ARRAY2(I)
ENDDO
DO I = 1, ISIZE3
ARRAY3(I) = I**3
WRITE(11) ARRAY3(I)
ENDDO
CLOSE(11)
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./buswrite.f90 -o ./buswrite
mpirun -np 32 ./buswrite
I've got 32 000.DAT ~ 031.DAT
PROGRAM BUSREAD
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
READ(11) ARRAY1(I)
IF(ARRAY1(I).NE.I) STOP
ENDDO
DO I = 1, ISIZE2
READ(11) ARRAY2(I)
IF(ARRAY2(I).NE.I**2) STOP
ENDDO
DO I = 1, ISIZE3
READ(11) ARRAY3(I)
IF(ARRAY3(I).NE.I**3) STOP
ENDDO
CLOSE(11)
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
IF(MYPE.EQ.0) WRITE(*,*) 'GOOD'
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./busread.f90 -o ./busread
mpirun -np 32 ./busread
I've got 'GOOD' output text from terminal as expected, but the machine in question is terminated with a sigbus error while running busread.
The issue was not observed after a device reboot. Even though I ran 4 programs at the same time under the same conditions, no problem occurred. In addition, other teams that used the device also had similar problems, which were resolved after reboot. The conclusion is a bit ridiculous, but if there are any people experiencing similar problems, I would like to summarize it as follows.
If your program terminates abnormally due to a memory error (like sigbus and sigsegv) while reading or writing a file, you can check the following.
Make sure there are no errors in your program. Check whether the time of occurrence of the error is constant or irregular, whether other programs have the same symptoms, whether it runs well on other machines, and whether there is a problem when run with a memory error checking tool such as valgrind.
Optimize the file I/O part. In the case of fortran, processing an entire array is tens of times faster than processing by element.
Immediately after an error occurs, try ssh connection to the machine (or node) to check whether the connection is smooth and that the file system is well accessed. If you cannot access the bashrc file or an error such as stale file handle occurs, please contact the system manager after combining the above reviewed information.
If someone has anything to add or if this post isn't appropriate, please let me know.
I'm running a distributed model stripped to its bare minimum below:
integer, parameter :: &
nx = 1200,& ! Number of columns in grid
ny = 1200,& ! Number of rows in grid
nt = 6000 ! Number of timesteps
integer :: it ! Loop counter
real :: var1(nx,ny), var2(nx,ny), var3(nx,ny), etc(nx,ny)
! Create netcdf to write model output
call check( nf90_create(path="out.nc",cmode=nf90_clobber, ncid=nc_out_id) )
! Loop over time
do it = 1,nt
! Calculate a lot of variables
...
! Write some variables in out.nc at each timestep
CALL check( nf90_put_var(ncid=nc_out_id, varid=var1_varid, values=var1, &
start = (/ 1, 1, it /), count = (/ nx, ny, 1 /)) )
! Close the netcdf otherwise it is not readable:
if (it == nt) call check( nf90_close(nc_out_id) )
enddo
I'm in the development stage of the model so, it inevitably crashes at unexpected points (usually at the Calculate a lot of variables stage), which means that, if the model crashes at timestep it =3000, 2999 timesteps will be written to the netcdf output file, but I will not be able to read the file because the file has not been closed. Still, the data have been written: I currently have a 2GB out.nc file that I can't read. When I ncdump the file it shows
netcdf out.nc {
dimensions:
x = 1400 ;
y = 1200 ;
time = UNLIMITED ; // (0 currently)
variables:
float var1 (time, y, x) ;
data:
}
My questions are: (1) Is there a way to close the file retrospectively, even outside Fortran, to be able to read the data that have already been written? (2) Alternatively, is there another way to write the file in Fortran that would make the file readable even without closing it?
When nf90_close is called, buffered output is written to disk and the file ID is relinquished so it can be reused. The problem is most likely due to buffered output not having been written to the disk when the program terminates due to a crash, meaning that only the changes you made in "define mode" are present in the file (as shown by ncdump).
You therefore need to force the data to be written to the disk more often. There are three ways of doing this (as far as I am aware).
nf90_sync - which synchronises the buffered data to disk when called. This gives you the most control over when to output data (every loop step, or every n loop steps, for example), which can allow you to optimize for speed vs robustness, but introduces more programming and checking overhead for you.
Thanks to #RussF for this idea. Creating or opening the file using the nf90_share flag. This is the recommended approach if the netCDF file is intended to be used by multiple readers/writers simultaneously. It is essentially the same as an automatic implementation of nf90_sync for writing data. It gives less control, but also less programming overhead. Note that:
This only applies to netCDF-3 classic or 64-bit offset files.
Finally, an option I wouldn't recommend, but am including for completeness (and I guess there may be situations where this is the best option, although none spring to mind) - closing and reopening the file. I don't recommend this, because it will slow down your program, and adds greater possibility of causing errors.
There is 3 columns in the file i'm reading and I want to average each column and take the std. The code compiles now, but nothing is being printed.
Here is my code:
program cardata
implicit none
real, dimension(291) :: x
intEGER I,N
double precision date, odometer, fuel
real :: std=0
real :: xbar=0
open(unit=10, file="car.dat", FOrm="FORMATTED", STATUS="OLD", ACTION="READ")
read(10,*) N
do I=1,N
read(10,*) x(I)
xbar= xbar +x(I)
enddo
xbar = xbar/N
DO I =1,N
std =std +((x(I) -xbar))**2
enddo
std = SQRT((std / (N - 1)))
print*,'mean:',xbar
print*, 'std deviation:',std
close(unit=10)
end program cardata
I am fairly new to this, any input will be greatly appreciated.
Example of car.dat:
date odometer fuel
19930114 298 22.4
19930118 566 18.1
19930118 800 18.9
19930121 960 15.8
19930125 1247 19.8
19930128 1521 17.1
19930128 1817 19.8
19930202 2079 18.0
19930202 2342 10.0
19930209 2511 16.4
19930212 2780 16.7
19930214 3024 19.0
19930215 3320 17.7
19930302 3560 16.4
19930312 3853 18.8
19930313 4105 18.5
From the car.dat that you gave in the comments, it's surprising that the program doesn't show anything. When I run it, it shows a very clear runtime error:
$ gfortran -o cardata cardata.f90
$ ./cardata
At line 12 of file cardata.f90 (unit = 10, file = 'car.dat')
Fortran runtime error: Bad integer for item 1 in list input
You seem to be copying code from another example without really understanding what it does. The code, as you wrote it, expects the file car.dat to be in a certain format: First an integer, which corresponds to the number of items in the file, then a single real per line. So something like this:
5
1.2
4.1
2.2
0.4
-5.2
But with your example, the first line contains text (that is, the description of the different columns), and when it tries to parse that into an integer (N) it must fail.
I will not give you the complete example, as I have the nagging suspicion that this is some sort of homework from which you are supposed to learn something. But here are a few hints:
You can easily read several values per line:
read(10, *) date(I), odometer(I), fuel(I)
I'm assuming here that, different to your program, date, odometer, and fuel are arrays. date and odometer can be integer, but fuel must be real (or double precision, but that's not necessary for these values).
You need to jump over the first line before you can start. You can just read the line into a dummy character(len=64) variable. (I picked len=64, but you can pick any other length that you feel confident with, but it should be long enough to actually contain the whole line.)
The trickiest bit is how to get your N as it is not given at the beginning of the file. What you can do is this:
N = 0
readloop : do
read(10, fmt=*, iostat=ios) date(N+1), odometer(N+1), fuel(N+1)
if (ios /= 0) exit readloop
N = N + 1
end do readloop
Of course you need to declare INTEGER :: ios at the beginning of your program. This will try to read the values into the next position on the arrays, and if it fails (usually because it has reached the end of the file) it will just end.
Note that this again expects date, odometer, and fuel to be arrays, and moreover, to be arrays large enough to contain all values. If you can't guarantee that, I recommend reading up on allocatable arrays and how to dynamically increase their size.