I went through the OpenACC 2.6 supported features with PGI compilers, and encountered an issue with the memory management between CPU and GPU.
The following Fortran code is a modified version from the official document:
module data
integer, parameter :: maxl = 100000
real, dimension(maxl) :: xstat
real, dimension(:), allocatable :: yalloc
!$acc declare create(xstat,yalloc)
end module
module useit
use data
contains
subroutine compute(n)
integer :: n
integer :: i
!$acc parallel loop present(yalloc)
do i = 1, n
yalloc(i) = iprocess(i)
enddo
end subroutine
real function iprocess(i)
!$acc routine seq
integer :: i
iprocess = yalloc(i) + 2*xstat(i)
end function
end module
program main
use data
use useit
implicit none
integer :: nSize = 100
!---------------------------------------------------------------------------
call allocit(nSize)
call initialize
call compute(nSize)
!$acc update self(yalloc)
write(*,*) "yalloc(10)=",yalloc(10) ! should be 3
call finalize
contains
subroutine allocit(n)
integer :: n
allocate(yalloc(n))
end subroutine allocit
subroutine initialize
xstat = 1.0
yalloc = 1.0
!$acc enter data copyin(xstat,yalloc)
end subroutine initialize
subroutine finalize
deallocate(yalloc)
end subroutine finalize
end program main
This code can be compiled with nvfortran:
nvfortran -Minfo test.f90
and it shows the expected value on CPU:
yalloc(10)= 3.000000
However, when compiled with OpenACC:
nvfortran -add -Minfo test.f90
the code does not show the correct output:
upload CUDA data device=0 threadid=1 variable=descriptor bytes=128
upload CUDA data device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=initialize line=55 device=0 threadid=1 variable=.attach. bytes=8
launch CUDA kernel file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=compute line=14 device=0 threadid=1 num_gangs=1 num_workers=1 vector_length=128 grid=1 block=128
download CUDA data file=/home/yang/GPU-Collection/openacc/basics/globalArray.f90 function=main line=41 device=0 threadid=1 variable=yalloc bytes=400
yalloc(10)= 0.000000
I have tried to add some explicit memory movement in several places, but nothing helps. This is really confusing to me.
The problem is in your initialize routine:
subroutine initialize
xstat = 1.0
yalloc = 1.0
!acc enter data copyin(xstat,yalloc)
!$acc update device(xstat,yalloc)
end subroutine initialize
Since xstat and yalloc are already in a data region (the declare directive), the second data region ("enter data copyin") is essentially ignored (though the reference counter is updated). Instead, you need to use an update directive to synchronize the data.
With this change, the code gets the correct answers:
% nvfortran test.f90 -acc -Minfo=accel; a.out
compute:
14, Generating Tesla code
15, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
iprocess:
19, Generating acc routine seq
Generating Tesla code
main:
41, Generating update self(yalloc(:))
initialize:
56, Generating update device(yalloc(:),xstat(:))
yalloc(10)= 3.000000
Related
I am trying to run a piece of fortran code written in f95. I have compiled it using gfortran in Ubuntu.
In the code there is a command to read in a text file. When I run it, it gives me the following error:
Fortran runtime error: Cannot open file 'input_parameters.txt': Operation not supported
This is the code up until the point that we attempt to read the text file:
program LSmodel
implicit none !this is a fortran thing that means that all variables that start with i,j,k,l,m,n are integers.
real :: sec,ran,gasdev ! random generator variables
real :: x,y,z,u,v,w,ut,vt,wt,t,dt ! simulation variables
real :: wg ! seed parametes
real :: Um,sigma_u,sigma_v,sigma_w,uw ! wind statistics variables
real :: dvaru_dz,dvarv_dz,dvarw_dz,duw_dz ! wind statistics variables
real :: dissip_m,TL ! vector over the range of ustars
real :: zs,zg,zmax ! release height & boundaries
real :: Ainv,C0inv ! inverse parameters
real :: C0,A,b,au,av,aw,dt_on_TL ! LS model parameters
real :: dz_max,dt_max ! time step limit
real :: CT,beta ! Crossing Trajectories correction
real :: C_chi,chi,TKE,T_chi,omega ! DI parameters
real :: a_ln,b_ln,sigma_chi,dissip_s ! DI parameters
real :: rhop,rho,r,g,gt,Re,AIP,Cd,nu ! IP parameters
real :: up,vp,wp,upt,vpt,wpt,vr,dt_ip,alpha ! IP parameters
real :: keepseed, maxheight
integer :: seed ! random generator variables, keepseed decides whether to keep the same seed or not for comparison of simulation
integer :: pnum, traj_exit ! simulation parameters. traj_exit counts the number of particles that have exited from the topo f the wind flow.
integer :: i,j,jj,n,ii ! counting parameters
integer :: n_ip,IP=1 ! IP parameters
character(len=80) :: filename, wgchar, foldername
real, allocatable,dimension(:) :: z_vec,Um_vec,sigma_u_vec,sigma_v_vec,sigma_w_vec,uw_vec
real, allocatable,dimension(:) :: dvaru_dz_vec,dvarv_dz_vec,dvarw_dz_vec,duw_dz_vec,dissip_m_vec
! input
open (23,file='input_parameters.txt') !opening a file for the input parameters....
read (23, *) x,C0,wg,zs,zg,beta,dt_on_TL,y,sigma_chi,C_chi,r,rhop,alpha,rho,nu, keepseed, foldername
close(23)
I am running Ubuntu 18.04.2 LTS.
An update - I have found (I believe) the reason this code was not working, although I don't know why.
The folder was in a network drive, not on my local computer. Once I moved the folder onto my local computer, I stopped getting this error.
While developing the program on Fortran, that employs some iteration procedure, I faced the necessity to stop iterations manually (to exit from the iteration loop without program termination).
I decided to do it sending a signal to the process. I have chosen SIGALRM. I have checked that it can be trapped without any unexpected consequences.
When received signal, the flag value is changed. This flag is checked inside the iteration loop and exit if flag is true. The sample of such code is given below.
!file mymod.f90
module mymod
use ifport
integer*4 :: err
integer*4 :: SIGNSET
integer*4, parameter :: mySignal=14
logical*1 :: toStopIteration
contains
! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !
integer*4 function setTrap() result(ret)
implicit none
call PXFSTRUCTCREATE('sigset',SIGNSET,err)
call PXFSIGADDSET(SIGNSET,mySignal,err) !add my signal to the set.
ret=0; return
end function setTrap
! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !
integer*4 function onTrap(sig_num) result(rcode)
implicit none
integer*4 :: sig_num,err
rcode=0
select case (sig_num)
case(mySignal)
write (*,*) 'Signal occurred. Stop iteration called'
write (*,*) 'flag: ',toStopIteration
toStopIteration=.true.
write (*,*) 'flag: ',toStopIteration
rcode=1
return
case (SIGINT) ; stop
case (SIGTERM); stop
case (SIGABRT); stop
end select
end function onTrap
! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !
end module mymod
!file main.f90
program main
use mymod
implicit none
integer*4 :: i,j,N,Niters,sum1
err=setTrap()
err=signal(mySignal, onTrap, -1)
toStopIteration=.false.
open (1,file='output')
write (*,*) 'PID=',getpid()
write (1,*) 'Outside',toStopIteration
N=5000000; Niters=100000
do i = 1,Niters
if (toStopIteration) then
toStopIteration=.false.
exit
endif
sum1=0
do j = 1,N
sum1=sum1+j
enddo
write (1,*) i,toStopIteration,sum1
enddo
write (*,*) 'Procedure was terminated due to signal received. The last iteration was', i
write (*,*) 'Now I will do other job for you.'
stop
end program main
Application was compiled with ifort: ifort -c -O2 -traceback.
When I send signal to the process kill -14 pid,
I get output to the terminal:
Signal occurred. Stop iteration called
flag: F
flag: T
But iteration loop is still running and as written in the file, variable "toStopIteration" is equal false.
Accidentally, I have found out that when compiled with -O0 -traceback parameter, it works fine.
Why does it happen? Does variable "toStopIteration" become local with such optimization level? And what can I do to make it work correctly?
Thanks in advance.
MuKeP.
As Lorri answered (unfortunately that succinct, but correct, answer has been deleted by misdirected reviews) - try the volatile attribute on toStopIteration. This tells the compiler that the variable might be redefined by something else, otherwise from the source visible to the compiler it looks like the value of that variable cannot change in an iteration, and therefore there is no point testing it each iteration.
I have some HDF data that has been created with PyTables. This data is very large, an array of 3973850000 x 8 double precision values, but with PyTables compression this can easily be stored.
I want to access this data using Fortran. I do,
PROGRAM HDF_READ
USE HDF
IMPLICIT NONE
CHARACTER(LEN=100), PARAMETER :: filename = 'example.h5'
CHARACTER(LEN=100), PARAMETER :: dsetname = 'example_dset.h5'
INTEGER error
INTEGER(HID_T) :: file_id
INTEGER(HID_T) :: dset_id
INTEGER(HID_T) :: space_id
INTEGER(HSIZE_T), DIMENSION(2) :: data_dims, max_dims
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE :: dset_data
!Initialize Fortran interface
CALL h5open_f(error)
!Open an existing file
CALL h5open_f(filename, H5F_ACC_RDONLY_F, file_id,error)
END PROGRAM HDF_READ
!Open a dataset
CALL h5dopen_f(file_id, dsetname, dset_id, error)
!Get dataspace ID
CALL h5dget_space_f(dset_id, space_id, error)
!Get dataspace dims
CALL h5sget_simple_extent_dims_f(space_id, data_dims,max_dims, error)
!Create array to read into
ALLOCATE(dset_data(data_dims(1), data_dims(2)))
!Get the data
CALL h5dread_f(dset_id, H5T_NATIVE_DOUBLE, dset_data, data_dims,error)
However, this creates an obvious problem, in that the array cannot be allocated to such a large size with double precision floats as it becomes greater than the system memory.
What is the best method for accessing this data? My current thoughts are for some sort of chunking method? Or is there a way to store the array on disk? Does HDF have methods for dealing with large data like this - I have read around but can find nothing pertaining to my case.
Using the following code, is it correct? I have 2GB Geforce 750M and using the PGI Fortran compiler. The program works fine for 4000x4000 arrays, anything higher it complains even though it should not, You can see i have allocated a 9000x9000 array but if i use a n value > 4000 it complains and throws a runtime error.
program matrix_multiply
!use openacc
implicit none
integer :: i,j,k,n
real, dimension(9000,9000) :: a, b, c
real x_scalar
real x_vector(2)
n=5000
call random_number (b)
call random_number (a)
!$acc kernels
do k = 1,n
do i = 1,n
do j = 1,n
c(i,k) = c(i,k) + a(i,j) * b(j,k)
enddo
enddo
enddo
!$acc end kernels
end program matrix_multiply
Thanks to Robert Crovella
My guess is that there is some sort of display timeout on the mac (also here) As you increase to a larger size, the matrix multiply kernel takes longer. At some point the display driver timeout in the Mac OS resets the GPU. If that is the case, you could work around it by switching to a system/GPU where the GPU is not hosting a display. Both Linux and Windows (TDR) also have such timeout mechanisms.
You have to boot into >console mode in Mac OS and also disable automatic graphic switching as the console mode turns off Aqua (GUI in Mac) and thus is supposed to remove the limitation.
This is the code for a matrix multiplication
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sum
sum=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$acc region do
do i=1,256
do j=1,256
sum=0
do k=1,256
sum=sum+a(i,k)*b(k,j)
c(i,j)=sum
end do
end do
end do
!$acc end region
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,c
end program ex
When I execute this the execution time is 75 msec when using the accelerator directives and the PGI compiler. But when I run same matrix multiplication with a "cuda fortran" implementation the execution time is only 5msec. So there is big difference even though I used the accelerator directives. So I doubt that my accelerator directives are working properly.
I tried to accelerate your program using very similar accelerator directives OpenHMPP. Note that I switched one your line, that is probably errorneously in the innermost loop. Also note, that I had to advice the compiler of the reduction taking place. Also I renamed the reduction variable, because it shadowed the sum intrinsic function.
The performance is not good, because of the overheead with starting the GPU kernel and because of the memory transfers. You need orders of magnitude more work for it to be profitable to use GPU.
For example when I used matrices 2000 x 2000 then the CPU execution time was 41 seconds, but GPU execution time only 8 s.
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sm
sm=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$hmpp region, target = CUDA
!$hmppcg gridify, reduce(+:sm)
do i=1,256
do j=1,256
sm=0
do k=1,256
sm=sm+a(i,k)*b(k,j)
end do
c(i,j)=sm
end do
end do
!$hmpp endregion
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,sum(c)
end program ex
edit: it would be probably not to use reduce(+:sm), but just private(sm)
FYI, the OP also posted this question on the PGI User Forum (http://www.pgroup.com/userforum/viewtopic.php?t=3081). We believe the original issue was the result of pilot error. When we profiled his code using CUDA Prof, the CUDA Fortran kernel execution time was 205 ms versus 344 ms using the PGI Accelerator Model. Also, if I fix his code so that "c(i,j)=sum" is placed outside of the inner "k" loop, the PGI Accelerator Model time reduces to 123ms. It's unclear how he gathered his timings.
Thanks to those that tried to help.
- Mat