OpenACC error when running programs of higher magnitude - fortran

Using the following code, is it correct? I have 2GB Geforce 750M and using the PGI Fortran compiler. The program works fine for 4000x4000 arrays, anything higher it complains even though it should not, You can see i have allocated a 9000x9000 array but if i use a n value > 4000 it complains and throws a runtime error.
program matrix_multiply
!use openacc
implicit none
integer :: i,j,k,n
real, dimension(9000,9000) :: a, b, c
real x_scalar
real x_vector(2)
n=5000
call random_number (b)
call random_number (a)
!$acc kernels
do k = 1,n
do i = 1,n
do j = 1,n
c(i,k) = c(i,k) + a(i,j) * b(j,k)
enddo
enddo
enddo
!$acc end kernels
end program matrix_multiply

Thanks to Robert Crovella
My guess is that there is some sort of display timeout on the mac (also here) As you increase to a larger size, the matrix multiply kernel takes longer. At some point the display driver timeout in the Mac OS resets the GPU. If that is the case, you could work around it by switching to a system/GPU where the GPU is not hosting a display. Both Linux and Windows (TDR) also have such timeout mechanisms.
You have to boot into >console mode in Mac OS and also disable automatic graphic switching as the console mode turns off Aqua (GUI in Mac) and thus is supposed to remove the limitation.

Related

OpenACC Declare Construct

I went through the OpenACC 2.6 supported features with PGI compilers, and encountered an issue with the memory management between CPU and GPU.
The following Fortran code is a modified version from the official document:
module data
integer, parameter :: maxl = 100000
real, dimension(maxl) :: xstat
real, dimension(:), allocatable :: yalloc
!$acc declare create(xstat,yalloc)
end module
module useit
use data
contains
subroutine compute(n)
integer :: n
integer :: i
!$acc parallel loop present(yalloc)
do i = 1, n
yalloc(i) = iprocess(i)
enddo
end subroutine
real function iprocess(i)
!$acc routine seq
integer :: i
iprocess = yalloc(i) + 2*xstat(i)
end function
end module
program main
use data
use useit
implicit none
integer :: nSize = 100
!---------------------------------------------------------------------------
call allocit(nSize)
call initialize
call compute(nSize)
!$acc update self(yalloc)
write(*,*) "yalloc(10)=",yalloc(10) ! should be 3
call finalize
contains
subroutine allocit(n)
integer :: n
allocate(yalloc(n))
end subroutine allocit
subroutine initialize
xstat = 1.0
yalloc = 1.0
!$acc enter data copyin(xstat,yalloc)
end subroutine initialize
subroutine finalize
deallocate(yalloc)
end subroutine finalize
end program main
This code can be compiled with nvfortran:
nvfortran -Minfo test.f90
and it shows the expected value on CPU:
yalloc(10)= 3.000000
However, when compiled with OpenACC:
nvfortran -add -Minfo test.f90
the code does not show the correct output:
upload CUDA data device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=initialize line=55 device=0 threadid=1 variable=.attach. bytes=8
launch CUDA kernel file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=compute line=14 device=0 threadid=1 num_gangs=1 num_workers=1 vector_length=128 grid=1 block=128
download CUDA data file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=main line=41 device=0 threadid=1 variable=yalloc bytes=400
yalloc(10)= 0.000000
I have tried to add some explicit memory movement in several places, but nothing helps. This is really confusing to me.
The problem is in your initialize routine:
subroutine initialize
xstat = 1.0
yalloc = 1.0
!acc enter data copyin(xstat,yalloc)
!$acc update device(xstat,yalloc)
end subroutine initialize
Since xstat and yalloc are already in a data region (the declare directive), the second data region ("enter data copyin") is essentially ignored (though the reference counter is updated). Instead, you need to use an update directive to synchronize the data.
With this change, the code gets the correct answers:
% nvfortran test.f90 -acc -Minfo=accel; a.out
compute:
14, Generating Tesla code
15, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
iprocess:
19, Generating acc routine seq
Generating Tesla code
main:
41, Generating update self(yalloc(:))
initialize:
56, Generating update device(yalloc(:),xstat(:))
yalloc(10)= 3.000000

SIGBUS occurs when fortran code reads file on linux cluster

When I run my fortran code in parallel on a linux cluster with mpirun I get a sigbus error.
It occurs while reading a file, the timing is irregular, and sometimes it proceeds without error.
I have tried debug compilation options like -g, but I haven't gotten any information on what line the error is coming from.
Actually the code was executed previously in three different clusters without this error, but the error is only occurring on this machine.
I personally suspect this is related to the performance of the machine (especially storage i/o), but I am not sure.
The program code is simple. Each process executed by mpirun reads the file corresponding to its rank as follows.
!!!!!!!!!! start of code
OPEN(11, FILE='FILE_NAME_WITH_RANK', FORM='UNFORMATTED')
READ(11,*) ISIZE
ALLOCATE(SOME_VARIABLE(ISIZE))
DO I = 1, ISIZE
READ(11,*) SOME_VARIABLE(I)
ENDDO
READ(11,*) ISIZE2
ALLOCATE(SOME_VARIABLE2(ISIZE2))
DO I = 1, ISIZE2
READ(11,*) SOME_VARIABLE2(I)
ENDDO
! MORE VARIABLES
CLOSE(11)
!!!!!!!!!! end of code
I used 191 cpu, and the total size of 191 files it loads is about 11 GB.
The cluster used for execution consists of 24 nodes with 16 cpu each (384 cpu total) and uses common storage that is shared with another cluster.
I ran the code in parallel by specifying nodes 1 through 12 as the hostfile.
Initially, I had 191 cpu read all files at the same time out of sequence.
After doing so, the program ended with a sigbus error. Also, for some nodes, the ssh connection was delayed, and the bashrc file cannot be found by node with stale file handle error.
The stale file handle error waited a bit and it seemed to recover by itself, but I'm not sure what the system administrator did.
So, I changed it to the following code so that only one cpu can read the file at a time.
!!!!!!!!!! start of code
DO ICPU = 0, NUMBER_OF_PROCESS-1
IF(ICPU.EQ.MY_PROCESS) CALL READ_FILE
CALL MPI_BARRIER(MPI_COMMUNICATOR,IERR)
ENDDO
!!!!!!!!!! end of code
This seemed to work fine for single execution, but if I ran more than one of these programs at the same time, the first mpirun stopped and both ended with a sigbus error eventually.
My next attempt is to minimize the execution of the read statement by deleting the do statement when reading the array. However, due to limited time, I couldn't test the effectiveness of this modification.
Here are some additional information.
If I execute a search or copy a file with an explorer such as nautilus while running a parallel program, nautilus does not respond or the running program raise sigbus. In severe cases, I wasn't able to connect the VNC server with stale file handle errors.
I use OpenMPI 2.1.1, GNU Fortran 4.9.4.
I compile the program with following
$OPENMPIHOME/bin/mpif90 -mcmodel=large -fmax-stack-var-size-64 -cpp -O3 $SOURCE -o $EXE
I execute the program with following in gnome terminal
$OPENMPIHOME/bin/mpirun -np $NP -x $LD_LIBRARY_PATH --hostfile $HOSTFILE $EXE
The cluster is said to be running commercial software like FLUENT without problems.
Summing up the above, my personal suspicion is that the storage of the cluster is dismounted due to the excessive disk I/O generated by my code, but I don't know if this makes sense because I have no cluster knowledge.
If yes, I wonder if there is a way to minimize the disk I/O, if it is enough to proceed with the vectorized I/O mentioned above, or if there is an additional part.
I would appreciate it if you could tell me anything about the problem.
Thanks in advance.
!!!
I wrote an example code. As mentioned above, it may not be easy to reproduce because the occurrence varies depending on the machine.
PROGRAM BUSWRITE
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
ARRAY1(I) = I
WRITE(11) ARRAY1(I)
ENDDO
DO I = 1, ISIZE2
ARRAY2(I) = I**2
WRITE(11) ARRAY2(I)
ENDDO
DO I = 1, ISIZE3
ARRAY3(I) = I**3
WRITE(11) ARRAY3(I)
ENDDO
CLOSE(11)
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./buswrite.f90 -o ./buswrite
mpirun -np 32 ./buswrite
I've got 32 000.DAT ~ 031.DAT
PROGRAM BUSREAD
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
READ(11) ARRAY1(I)
IF(ARRAY1(I).NE.I) STOP
ENDDO
DO I = 1, ISIZE2
READ(11) ARRAY2(I)
IF(ARRAY2(I).NE.I**2) STOP
ENDDO
DO I = 1, ISIZE3
READ(11) ARRAY3(I)
IF(ARRAY3(I).NE.I**3) STOP
ENDDO
CLOSE(11)
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
IF(MYPE.EQ.0) WRITE(*,*) 'GOOD'
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./busread.f90 -o ./busread
mpirun -np 32 ./busread
I've got 'GOOD' output text from terminal as expected, but the machine in question is terminated with a sigbus error while running busread.
The issue was not observed after a device reboot. Even though I ran 4 programs at the same time under the same conditions, no problem occurred. In addition, other teams that used the device also had similar problems, which were resolved after reboot. The conclusion is a bit ridiculous, but if there are any people experiencing similar problems, I would like to summarize it as follows.
If your program terminates abnormally due to a memory error (like sigbus and sigsegv) while reading or writing a file, you can check the following.
Make sure there are no errors in your program. Check whether the time of occurrence of the error is constant or irregular, whether other programs have the same symptoms, whether it runs well on other machines, and whether there is a problem when run with a memory error checking tool such as valgrind.
Optimize the file I/O part. In the case of fortran, processing an entire array is tens of times faster than processing by element.
Immediately after an error occurs, try ssh connection to the machine (or node) to check whether the connection is smooth and that the file system is well accessed. If you cannot access the bashrc file or an error such as stale file handle occurs, please contact the system manager after combining the above reviewed information.
If someone has anything to add or if this post isn't appropriate, please let me know.

Retrospectively closing a NetCDF file created with Fortran

I'm running a distributed model stripped to its bare minimum below:
integer, parameter :: &
nx = 1200,& ! Number of columns in grid
ny = 1200,& ! Number of rows in grid
nt = 6000 ! Number of timesteps
integer :: it ! Loop counter
real :: var1(nx,ny), var2(nx,ny), var3(nx,ny), etc(nx,ny)
! Create netcdf to write model output
call check( nf90_create(path="out.nc",cmode=nf90_clobber, ncid=nc_out_id) )
! Loop over time
do it = 1,nt
! Calculate a lot of variables
...
! Write some variables in out.nc at each timestep
CALL check( nf90_put_var(ncid=nc_out_id, varid=var1_varid, values=var1, &
start = (/ 1, 1, it /), count = (/ nx, ny, 1 /)) )
! Close the netcdf otherwise it is not readable:
if (it == nt) call check( nf90_close(nc_out_id) )
enddo
I'm in the development stage of the model so, it inevitably crashes at unexpected points (usually at the Calculate a lot of variables stage), which means that, if the model crashes at timestep it =3000, 2999 timesteps will be written to the netcdf output file, but I will not be able to read the file because the file has not been closed. Still, the data have been written: I currently have a 2GB out.nc file that I can't read. When I ncdump the file it shows
netcdf out.nc {
dimensions:
x = 1400 ;
y = 1200 ;
time = UNLIMITED ; // (0 currently)
variables:
float var1 (time, y, x) ;
data:
}
My questions are: (1) Is there a way to close the file retrospectively, even outside Fortran, to be able to read the data that have already been written? (2) Alternatively, is there another way to write the file in Fortran that would make the file readable even without closing it?
When nf90_close is called, buffered output is written to disk and the file ID is relinquished so it can be reused. The problem is most likely due to buffered output not having been written to the disk when the program terminates due to a crash, meaning that only the changes you made in "define mode" are present in the file (as shown by ncdump).
You therefore need to force the data to be written to the disk more often. There are three ways of doing this (as far as I am aware).
nf90_sync - which synchronises the buffered data to disk when called. This gives you the most control over when to output data (every loop step, or every n loop steps, for example), which can allow you to optimize for speed vs robustness, but introduces more programming and checking overhead for you.
Thanks to #RussF for this idea. Creating or opening the file using the nf90_share flag. This is the recommended approach if the netCDF file is intended to be used by multiple readers/writers simultaneously. It is essentially the same as an automatic implementation of nf90_sync for writing data. It gives less control, but also less programming overhead. Note that:
This only applies to netCDF-3 classic or 64-bit offset files.
Finally, an option I wouldn't recommend, but am including for completeness (and I guess there may be situations where this is the best option, although none spring to mind) - closing and reopening the file. I don't recommend this, because it will slow down your program, and adds greater possibility of causing errors.

Strange behavior while calling properties from REFPROP FORTRAN files

I am trying to use REFPROPs HSFLSH subroutine to compute properties for steam.
When the same state property is calculated over multiple iterations
(fixed enthalpy and entropy (Enthalpy = 50000 J/mol & Entropy = 125 J/mol),
the time taken to compute using HSFLSH after every 4th/5th iteration increases to about 0.15 ms against negligible amount of time for other iterations. This is turning problematic because my program places call to this subroutine over several thousand times. Thus leading to abnormally huge program run times.
The program used to generate the above log is here:
C refprop check
program time_check
parameter(ncmax=20)
dimension x(ncmax)
real hkj,skj
character hrf*3, herr*255
character*255 hf(ncmax),hfmix
C
C SETUP FOR WATER
C
nc=1 !Number of components
hf(1)='water.fld' !Fluid name
hfmix='hmx.bnc' !Mixture file name
hrf='DEF' !Reference state (DEF means default)
call setup(nc,hf,hfmix,hrf,ierr,herr)
if (ierr.ne.0) write (*,*) herr
call INFO(1,wm,ttp,tnbp,tc,pc,dc,zc,acf,dip,rgas)
write(*,*) 'Mol weight ', wm
h = 50000.0
s = 125.0
c
C
DO I=1,NCMAX
x(I) = 0
END DO
C ******************************************************
C THIS IS THE ACTUAL CALL PLACE
C ******************************************************
do I=1,100
call cpu_time(tstrt)
CALL HSFLSH(h,s,x,T_TEMP,P_TEMP,RHO_TEMP,dl,dv,xliq,xvap,
& WET_TEMP,e,
& cv,cp,VS_TEMP,ierr,herr)
call cpu_time(tstop)
write(*,*),I,' time taken to run hsflsh routine= ',tstop - tstrt
end do
stop
end
(of course you will need the FORTRAN FILES, which unfortunately I cannot share since REFPROP isn't open source)
Can someone help me figure out why is this happening.?
P.S : The above code was compiled using gfortran -fdefault-real-8
UPDATE
I tried using system_clock to time my computations as suggested by #Ross below. The results are uniform across the loop (image below). I will have to find alternate ways to improve computation speed I guess (Sigh!)
I don't have a concrete answer, but this sort of behaviour looks like what I would expect if all calls really took around 3 ms, but your call to CPU_TIME doesn't register anything below around 15 ms. Do you see any output with time taken less than, say 10 ms? Of particular interest to me is the approximately even spacing between calls that return nonzero time - it's about even at 5.
CPU timing can be a tricky business. I recommended in a comment that you try system_clock, which can be higher precision than CPU_TIME. You said it doesn't work, but I'm unconvinced. Did you pass a long integer to system_clock? What was the count_rate for your system? Were all the times still either 15 or 0 ms?

accelerator directives not working

This is the code for a matrix multiplication
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sum
sum=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$acc region do
do i=1,256
do j=1,256
sum=0
do k=1,256
sum=sum+a(i,k)*b(k,j)
c(i,j)=sum
end do
end do
end do
!$acc end region
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,c
end program ex
When I execute this the execution time is 75 msec when using the accelerator directives and the PGI compiler. But when I run same matrix multiplication with a "cuda fortran" implementation the execution time is only 5msec. So there is big difference even though I used the accelerator directives. So I doubt that my accelerator directives are working properly.
I tried to accelerate your program using very similar accelerator directives OpenHMPP. Note that I switched one your line, that is probably errorneously in the innermost loop. Also note, that I had to advice the compiler of the reduction taking place. Also I renamed the reduction variable, because it shadowed the sum intrinsic function.
The performance is not good, because of the overheead with starting the GPU kernel and because of the memory transfers. You need orders of magnitude more work for it to be profitable to use GPU.
For example when I used matrices 2000 x 2000 then the CPU execution time was 41 seconds, but GPU execution time only 8 s.
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sm
sm=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$hmpp region, target = CUDA
!$hmppcg gridify, reduce(+:sm)
do i=1,256
do j=1,256
sm=0
do k=1,256
sm=sm+a(i,k)*b(k,j)
end do
c(i,j)=sm
end do
end do
!$hmpp endregion
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,sum(c)
end program ex
edit: it would be probably not to use reduce(+:sm), but just private(sm)
FYI, the OP also posted this question on the PGI User Forum (http://www.pgroup.com/userforum/viewtopic.php?t=3081). We believe the original issue was the result of pilot error. When we profiled his code using CUDA Prof, the CUDA Fortran kernel execution time was 205 ms versus 344 ms using the PGI Accelerator Model. Also, if I fix his code so that "c(i,j)=sum" is placed outside of the inner "k" loop, the PGI Accelerator Model time reduces to 123ms. It's unclear how he gathered his timings.
Thanks to those that tried to help.
- Mat