A problem in calling several gpu subroutines sequentially: OpenACC - Fortran - fortran

I have the following problem. I have a main subroutine, let us call it main_function (for 3D BSplines). It takes as input several tensors.
This function contains only IF-conditions. If a condition is satisfied, other functions are called. Let us call these functions: function_a, function_b, and function_c which are parallelizable.
The structure is as follows
subroutine main_function(paras)
if(1) then
call function_a
else if (2)
call function_b
else if (3)
call function_c
end if
end subroutine main_function
with
subroutine function_a(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_a
subroutine function_b(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_b
subroutine function_c(paras)
!$acc parallel loop present(....)
do
heavy parallel calcs
end do
output: eta
end subroutine function_c
The subroutines function_a, function_b, and function_c have a B-spline tensor (eta) as an output calculated on GPU. I don't want to move this tensor to the host since it is not needed there. However, after calculating eta on GPU using main_function, an interpolation subroutine interpolate3D is called to interpolate the function. The definition of interpolate3D is something like
subroutine interpolate3D(eta, x, y, z, fAtxyz)
!$acc routine seq
interpolate ...
end subroutine interpolate3D
To summarize the the pseudo-code is something like
call main_function(paras)
!$acc parallel loop present(x, y, eta, fAtxyz)
do i = 1, N
call interpolate3D(eta, x(i), y(i), z(i), fAtxyz(i))
end do
My problems and questions are:
1)- When I don't use '!$acc update self (eta)' before the loop, the results are completely wrong. Does this mean that 'present clause' doesn't find correctly eta, calculated by main_function, on GPU. Therefore, one needs to update the host, and then recopy it back to the GPU?
2)- How to ensure that interpolate3D is working on GPU? For example, if I don't have the above loop, does only adding '!$acc routine seq' ensure that it works on GPU and searches for different quantities there?
3)- In fact, when there is no loop, adding '!$acc update self (eta)' is required to have correct results. Does this mean that in this case the subroutine is executed on CPU?
3)- To summarize, If I have two subroutines: the first choses between different subroutines based on if-conditions to calculate a vector or tensor and keep it on GPU (I don't want to update the host), while the second will use this vector to perform some calculations on GPU, how to do this correctly with openACC?
Sorry for being long and thank you very much for your help,
In fact, I have tried different strategies. However, all of them requires copying eta to the host before interpolating, even though it is only calculated on the device. There is something I don't understand since I'm also new to openacc

Cross-posted on NVIDIA's Forum: https://forums.developer.nvidia.com/t/b-splines-on-gpus-openacc-fortran/233053
Issue was an error in the user's code where a "parallel loop" was missing, hence the loop was not being run on the host.

Related

Fortran Openmp: Ghost threads interrupt application upon allocate

I am experiencing an issue which starts to drive me crazy. As part of a large software tool I am writing a Fortran module to perform astrodynamic computations.
The main functionality is contained in a subroutine which uses OpenMP and other subroutines to perform its tasks.
The workflow is as follows:
1) Preparations - read in files
2) Start the first parallel omp parallel do
3) Based on the working mode, the function might call itself recursively and process a separate part of the data (thereby executing a second omp do loop)
4) Combine the results into one large array of a derived type
Up to this point everything has been running as expected. Next is the part that fails:
5) Using the results from 4, start another parallel loop to continue the processing.
Step 5 sometimes works and sometimes it stops with an access violation. This is independently of the optimization level selected (nominally we use O2, but it happens also with O0).
Hence I fired up Intel inspector and looked for data races/deadlocs,....
It reported 5 very strange data races, which make no sense to me, as the code locations are either at the definition/end of a subroutine, a codeline reading threadprivate global variable, an Intel MKL routine or at a location reading a local allocatable array which is in a subroutine that is called within the parallel region.
After reading upon things a bit, I enabled the recursive switch to force the local array to end on the stack. Another Inspector analysis was interrupted by the same access violation. When loading the results no errors were detected, but the tool warns that due to the abnormal end data may have been lost.
I then tried to put the entire loop into an OMP critical - just as a test. Now the access violation disappeared, but the code is stuck very early.After processing 15/500 objects it just stops. This looked like a deadlock.
So I continued the search and commented out every OMP statement in the do loop of step 5.
Interestingly the code stops as well at iteration 15!
Hence I used the debugger to locate the call where the infinite waiting state occurs.
It is at the final allocate statement:
Subroutine doWork(t0, Pert, allAtmosData, considerArray, perigeeFirstTimeStepUncertainties)
real(dp), intent(in) :: t0
type (tPert), intent(in) :: Pert
type (tThermosphereData), dimension(:), allocatable, intent(in) :: allAtmosData
integer(i4), dimension(3), intent(in) :: considerArray
real(dp), dimension(:,:), allocatable, intent(out) :: perigeeFirstTimeStepUncertainties
!Locals:
real(dp), dimension(:), allocatable :: solarFluxPerigeeUnc, magneticIndexUnc, modelUnc
integer(i4) :: dataPoints
!Allocate memory for the computations
dataPoints = size(allAtmosData, 1)
allocate(solarFluxPerigeeUnc(0:dataPoints-1), source=0.0_dp)
allocate(magneticIndexUnc(0:dataPoints-1), source=0.0_dp)
allocate(modelUnc(0:dataPoints-1), source=0.0_dp)
... Subroutine body ...
!Assign outputs
allocate(perigeeFirstTimeStepUncertainties(3, 0:dataPoints-1), source=0.0_dp)
perigeeFirstTimeStepUncertainties(MGN_SOLAR_FLUX_UNCERTAINTY,:) = solarFluxPerigeeUnc
perigeeFirstTimeStepUncertainties(MGN_MAG_INDEX_UNCERTAINTY,:) = magneticIndexUnc
perigeeFirstTimeStepUncertainties(MGN_MODEL_UNCERTAINTY,:) = modelUnc
End Subroutine
The subroutine works perfectly in other places of the overall program and also for the first
14 iterations of step 5. In the 15th iteration however it is stuck at the last allocate, or as I just have managed to produce - also results in an access violation.
When I pause the code using the debugger, I see that somehow the openmp library has been loaded and the code hangs/crashes in frontend.cpp, which I have no access to:
What is happening here? How can a simple allocate suddenly cause and OMP activity in a loop where any OMP statement has been commented out?
As a sidenote: If I comment out the allocate and pass the size as an extra argument, the same error happens at the next allocate in the code.
Any help or workaround is highly appreciated! I already googled on how to shut down any prior thread-pool, so that there is definately no omp ghost thread activity left, but apparently there is no way to do so. Also ultimately the loop shall work in parallel.
EDIT: I just did a second test on the machine with 2018 v5. It also shows that the allocate statement results in a call to for_alloc_allocatable() and then the OMP statements. Also the mov statement where the program crashes is shown:
I guess that something is probably causing memory corruption in the second recursive OMP loop, as somehow the allocate is redirected to a threadpool memory allocation, however all threads but the main one should be passive? I cannot read assembler, but I was surprised to find that TBB is used by OpenMp? Could this be a bug in TBB?

Bad performance of parallel subroutine

I was trying to parallelize the following code; however, when it was executed on the main program, there didn't seem to be significant speed-up. I tested the same subroutine on another program, and it took even longer time to run than the serial code.
SUBROUTINE rotate(r,qt,n,np,i,a,b)
IMPLICIT NONE
INTEGER n,np,i
DOUBLE PRECISION a,b,r(np,np),qt(np,np)
INTEGER j
DOUBLE PRECISION c,fact,s,w,y
if(a.eq.0.d0)then
c=0.d0
s=sign(1.d0,b)
else if(abs(a).gt.abs(b))then
fact=b/a
c=sign(1.d0/sqrt(1.d0+fact**2),a)
s=fact*c
else
fact=a/b
s=sign(1.d0/sqrt(1.d0+fact**2),b)
c=fact*s
endif
!$omp parallel shared(i,n,c,s,r,qt) private(y,w,j)
!$omp do schedule(static,2)
do 11 j=i,n
y=r(i,j)
w=r(i+1,j)
r(i,j)=c*y-s*w
r(i+1,j)=s*y+c*w
11 continue
!$omp do schedule(static,2)
do 12 j=1,n
y=qt(i,j)
w=qt(i+1,j)
qt(i,j)=c*y-s*w
qt(i+1,j)=s*y+c*w
12 continue
!$omp end parallel
return
END
C (C) Copr. 1986-92 Numerical Recipes Software Vs94z&):9+X%1j49#:`*.
However when I used the built-in function in Linux to measure the time, i got:
real 0m12.160s
user 4m49.894s
sys 0m0.880s
which is ridiculous compared to the time of the serial code:
real 0m2.078s
user 0m2.068s
sys 0m0.000s
So you have something like
do i=1,n
do j=1,n
do k=1,n
call rotate()
end do
end do
end do
for n = 100 and you are parallelizing two simple loops inside rotate.
That is hopeless. If you want decent performance, you must parallelize the outermost loop that is possible.
There is simply not enough work inside the loops inside rotate and it is called too many times. You call it 1000000 times so the threads must be synchronised or re-launched 2000000 times. That takes all of your run time. All the run time increase you see is this synchronization.

Calling subroutine in parallel environment

I think my problem is related or even identical to the problem described here. But I don't understand what's actually happening.
I'm using openMP with the gfortran compiler and I have the following task to do: I have a density distribution F(X, Y) on a two-dimensional surface with x-coordinates X and y-coordinates Y. The matrix F has the size Nx x Ny.
I now have a set of coordinates Xp(i) and Yp(i) and I need to interpolate the density F onto these points. This problem is made for parallelization.
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
Everything is shared except for i. The function interp2d is doing some simple linear interpolation.
That works fine with one thread but fails with multithreading. I traced the problem down to the hunt-subroutine taken from Numerical Recipes, which gets called by interp2d. The hunt-subroutine basically calculates the index ix such that X(ix) <= Xp(i) < X(ix+1). This is needed to get the starting point for the interpolation.
With multithreading it happens every now and then, that one threads gets the correct index ix from hunt and the thread, that calls hunt next gets the exact same index, even though Xp(i) is not even close to that point.
I can prevent this by using the CRITICAL environment:
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
!$OMP CRITICAL
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
!$OMP END CRITICAL
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
But this decreases the efficiency. If I use for example three threads, I have a load average of 1.5 with the CRITICAL environment. Without I have a load average of 2.75, but wrong results and even sometimes a SIGSEGV runtime error.
What exactly is happening here? It seems to me that all the threads are calling the same hunt-subroutine and if they do it at the same time there is a conflict. Does that make sense?
How can I prevent this?
Combining variable declaration and initialisation in Fortran 90+ has the side effect of giving the variable the SAVE attribute.
integer :: i = 0
is roughly equivalent to:
integer, save :: i
if (first_invocation) then
i = 0
end if
SAVE'd variables retain their value between multiple invocations of the routine and are therefore often implemented as static variables. By the rules governing the implicit data sharing classes in OpenMP, such variables are shared unless listed in a threadprivate directive.
OpenMP mandates that compliant compilers should apply the above semantics even when the underlying language is Fortran 77.

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

Thread issues when writing to files with OpenMP in Fortran

The number of files that are getting written is always less than the number of threads. Logically for me, when I can have 4 threads and the CPU is working at 400%, I was expecting the number of files to be 4 (one each corresponding to every single thread). I don't know if there is a problem with my code or this is how it is supposed to work. The code is as follows:
!!!!!!!! module
module common
use iso_fortran_env
implicit none
integer,parameter:: dp=real64
real(dp):: aa,bb
contains
subroutine evolve(y,yevl)
implicit none
integer(dp),parameter:: id=2
real(dp),intent(in):: y(id)
real(dp),intent(out):: yevl(id)
yevl(1)=y(2)+1.d0-aa*y(1)**2
yevl(2)=bb*y(1)
end subroutine evolve
end module common
use common
implicit none
integer(dp):: iii,iter,i
integer(dp),parameter:: id=2
real(dp),allocatable:: y(:),yt(:)
integer(dp):: OMP_GET_THREAD_NUM, IXD
allocate(y(id)); allocate(yt(id)); y=0.d0; yt=0.d0; bb=0.3d0
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
!$OMP DO
do iii=1,20000; print*,iii !! EXPECTED THREADS TO BE OF 5000 ITERATIONS EACH
aa=1.d0+dfloat(iii-1)*0.4d0/80000.d0
loop1: do iter=1,10 !! THE INITIAL CONDITION LOOP
call random_number(y)!! RANDOM INITIALIZATION OF THE VARIABLE
loop2: do i=1,70000 !! ITERATION OF THE SYSTEM
call evolve(y,yt)
y=yt
enddo loop2 !! END OF SYSTEM ITERATION
write(IXD+1,*)aa,yt !!! WRITING FILE CORRESPONDING TO EACH THREAD
enddo loop1 !!INITIAL CONDITION ITERATION DONE
enddo
!$OMP ENDDO
!$OMP END PARALLEL
end
Is this behavior resulting from some race issue in the code? The code compiles and executes just fine without any warnings or errors with ifort version 13.1.0 on ubuntu. Thanks a bunch for any comments or suggestions.
The variable IXD should be explicitely declared as private to make sure every thread has an own copy of it. Changing the line(s)
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
to
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt,ixd) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
solves the problem.