Parallelizing DO loop with nvfortran on gpu - fortran

I am tring to parallelize a do loop in Fortran. Using OMP parallel do (and converted to standard do loop) it works nicely (using both gfortran and nvfortran), but when compiling it with nvfortran -stdpar=gpu it compiles, but running it, it crashes with:
0 Current file: xxx/pi.f90
function: pi
line: 15 This file was compiled: -acc=gpu -gpu=cc35 -gpu=cc50 -gpu=cc60 -gpu=cc60 -gpu=cc70 -gpu=cc75 -gpu=cc80 -
Here is the code:
program pi
implicit none
integer :: count, n, i
real :: r
real, dimension(10000) :: x,y
logical , dimension(10000) :: c
c = .false.
n=size(x,1)
print*,count(c)
call RANDOM_SEED
call random_number(x)
call random_number(y)
do concurrent (i = 1: n)
if (x(i)**2 + y(i)**2 <1.0) c(i)=.true.
end do
r = 4 * real(count(c))/n
print *, r
end program pi

Related

MPI_WTIME is not giving me speedup as required

Program Main
implicit none
include 'mpif.h'
!Define parameters
integer::my_rank,p2,n2,ierr,source
integer, parameter :: n=3,m=3,o=m*n
real(kind=8) aaa(n),ddd(n),bbb(n),ccc(n),xxx(n),b(m,n),start, finish
integer i, j
real h
real(kind=8),dimension(:),allocatable::sol1
h=0.25
b=0
do i=1,m
b(i,i)=1/(1.2**i)
b(i,i-1)=-b(i,i)
enddo
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,p2,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank,ierr)
allocate(sol1(o))
start=MPI_WTIME()
do i=1,n
aaa(i)=-1/h**2
bbb(i)=2/h**2+b(my_rank+1,my_rank+1)
ccc(i)=-1/h**2
ddd(i)=1/h**2
enddo
call thomas(aaa,bbb,ccc,ddd,xxx,n)
finish=MPI_WTIME()
print*, finish-start
write(*,*) xxx, my_rank
call MPI_GATHER(xxx,n, MPI_REAL, sol1,n,MPI_REAL8,0, MPI_COMM_WORLD,ierr)
print*,sol1
call MPI_FINALIZE(ierr)
end program main
subroutine thomas(ld,md,ud,rh,solution,n)
implicit none
integer,parameter :: r8 = kind(1.d0)
integer,intent(in) :: n
real(r8),dimension(n),intent(in) :: ld,md,ud,rh
real(r8),dimension(n),intent(out) :: solution
real(r8),dimension(n) :: P,Q
real(r8) :: m
integer i
P(1) = ud(1)/md(1)
Q(1) = rh(1)/md(1)
do i = 2,n
m = md(i)-p(i-1)*ld(i)
P(i) = ud(i)/m
Q(i) = (rh(i)-Q(i-1)*ld(i))/m
end do
solution(n) = Q(n)
do i = n-1, 1, -1
solution(i) = Q(i)-P(i)*solution(i+1)
end do
end subroutine thomas
Here I used MPI_WTIME() to find the execution time. It seems like when I increase the number of processor than I am not getting the speedup. In this code I have m=3 (I make m equal equal to no of processor). I run with mpirun -np 3 sp.exe). Now I change say m=10 and run with mpirun -np 10 sp.exe. I should get the less time, isn't it? or I am missing something here. The community helped me before with some issues and now I am getting another issue. I would really appreciate the help if somebody would point out something.Isn't the chunk of code starting with do loop done by invidual processors( which I want)?

Calculations on vectors become slower after better optimization flag and OpenMP

Consider the following Fortran code
program example
implicit none
integer, parameter :: ik = selected_int_kind(15)
integer, parameter :: rk = selected_real_kind(15,307)
integer(ik) :: N, i, j, pc, time_rate, start_time, end_time, M
real(rk), allocatable:: K(:,:), desc(:,:)
real(rk) :: kij, dij
integer :: omp_get_num_threads, nth
N = 2000
M = 400
allocate(K(N,N))
allocate(desc(N,M))
pc=10
do i = 1, N
desc(i,:) = real(i,rk)
if (i==int(N*pc)/100) then
print * ,"desc % complete: ",pc
pc=pc+10
endif
enddo
call system_clock(start_time)
!$OMP PARALLEL PRIVATE(nth)
nth = omp_get_num_threads()
print *,"omp threads", nth
!$OMP END PARALLEL
!$OMP PARALLEL DO &
!$OMP DEFAULT(SHARED) &
!$OMP PRIVATE(i,j,dij,kij)
do i = 1, N
do j = i, N
dij = sum(abs(desc(i,:) - desc(j,:)))
kij = dexp(-dij)
K(i,j) = kij
K(j,i) = kij
enddo
K(i,i) = K(i,i) + 0.1
enddo
!$OMP END PARALLEL DO
call system_clock(end_time, time_rate)
print* , "Time taken for Matrix:", real(end_time - start_time, rk)/real(time_rate, rk)
end program example
I compiled it using gfortran-6 on MacOS X 10.11 usin following flags
gfortran example.f90 -fopenmp -O0
gfortran example.f90 -fopenmp -O3
gfortran example.f90 -fopenmp -mtune=native
following which I ran it with single and double threads using OMP_NUM_THREADS variable. I can see that it is utilizing two cores. However O3 flag which should enable vectorization, does not help the performance at all, if anything it degrades it a bit. Timings are given below (in seconds) (avgd over 10 runs):
|Thrds->| 1 | 2 |
|Opt | | |
----------------------
|O0 |10.962|9.183|
|O3 |11.581|9.250|
|mtune |11.211|9.084|
What is wrong in my program?
First of all, if you want good performance from -O3, you should give it something that can actually be optimised. The bulk of the work happens in the sum intrinsic, which works on a vectorised expression. It doesn't get any more optimised when you switch from -O0 to -O3.
Also, if you want better performance, transpose desc because desc(i,:) is non-contiguous in memory. desc(:,i) is. That's Fortran - its matrices are column-major.

Unclassifiable OpenMP directive in a Fortran program

I was trying to parallelize a code in Fortran using openMP, with this code:
program pigreco
!----------------------------------------!
use OMP_LIB
implicit none
!----------------------------------------!
integer :: i
integer, parameter :: N = 100000
integer, parameter :: NCPU = 4
real*8 :: t0, t1
real :: h, totale, x, f
!----------------------------------------!
print '(a,2x,i15)', ' Number of intervals: ', N
totale = 0.0
h = 1. / N
call OMP_SET_NUM_THREADS(NCPU)
write(*, '(a,i10)') 'Numero di processori totali: ', NCPU
t0 = OMP_GET_WTIME()
!----------------------------------------!
#ifdef PARALLEL
!
print '(a)', "Scelta la versione parallela."
!
!$OMP PARALLEL DO PRIVATE(x, f) REDUCTION(+:totale)
!
do i = 1, N
x = (i - 0.5) * h
f = (4 * h) / (1 + x**2)
totale = totale + f
enddo
!$OMP END PARALLEL DO
!
#endif
!
t1 = OMP_GET_WTIME()
!
PRINT '(a,2x,f30.25)', ' Computed PI =', totale
PRINT '(a,2x,f30.25)', ' Total computational time =', t1 - t0
!
end program pigreco
When I then try to compile with the line: gfortran prova.F90 -fopenmp -D PARALLEL it gives me an error that says "unclassifiable OpenMP directive at (1)".
The problem is that you defined PARALLEL with the preprocessor, so instead of reading OMP PARALLEL DO, the compiler reads OMP 1 DO, which of course doesn't make sense. Change #ifdef PARALLEL to #ifdef RUNPARALLEL and -DPARALLEL to -DRUNPARALLEL, then the compiler gives no error.
Alternatively, you can use the fact that when compiling with OpenMP support the macro variable _OPENMP is defined automatically, so you could use #ifdef _OPENMP, and no -D flag.

How to efficiently calculate matrix inner product in Fortran?

I am trying to calculate something similar to a weighted matrix inner product in Fortran. The current script that I am using for calculating the inner product is as follows
! --> In
real(kind=8), intent(in), dimension(ni, nj, nk, nVar) :: U1, U2
real(kind=8), intent(in), dimension(ni, nj, nk) :: intW
! --> Out
real(kind=8), intent(out) :: innerProd
! --> Local
integer :: ni, nj, nk, nVar, iVar
! --> Computing inner product
do iVar = 1, nVar
innerProd = innerProd + sum(U1(:,:,:,iVar)*U2(:,:,:,iVar)*intW)
enddo
But I found that the above script that I am currently using is not very efficient. The same operation can be performed in Python using NumPy as follows,
import numpy as np
import os
# --> Preventing numpy from multi-threading
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
innerProd = 0
# --> Toy matrices
U1 = np.random.random((ni,nj,nk,nVar))
U2 = np.random.random((ni,nj,nk,nVar))
intW = np.random.random((ni,nj,nk))
# --> Reshaping
U1 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
U2 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
intW = np.reshape(np.ravel(intW), (ni*nj*nk))
# --> Calculating inner product
for iVar in range(nVar):
innerProd = innerProd + np.dot(U1[:, iVar], U2[:, iVar]*intW)
The second method using Numpy seems to be far more faster than the method using Fortran. For a specific case of ni = nj = nk = nVar = 130, the time taken by the two methods are as follows
fortran_time = 25.8641 s
numpy_time = 6.8924 s
I tried improving my Fortran code with ddot from BLAS as follows,
do iVar = 1, nVar
do k = 1, nk
do j = 1, nj
innerProd = innerProd + ddot(ni, U1(:,j,k,iVar), 1, U2(:,j,k,iVar)*intW(:,j,k), 1)
enddo
enddo
enddo
But there was no considerable improvement in time. The time taken by the above method for the case of ni = nj = nk = nVar = 130 is ~24s. (I forgot to mention that I compiled the Fortran code with '-O2' option for optimizing the performance).
Unfortunately, there is no BLAS function for element-wise matrix multiplication in Fortran. And I don't want to use reshape in Fortran because unlike python reshaping in Fortran will lead to copying my array to a new array leading to more RAM usage.
Is there any way to speed up the performance in Fortran so as to get close to the performance of Numpy?
You may not be timing what you think are timing. Here's a complete fortran example
program test
use iso_fortran_env, r8 => real64
implicit none
integer, parameter :: ni = 130, nj = 130, nk = 130, nvar = 130
real(r8), allocatable :: u1(:,:,:,:), u2(:,:,:,:), w(:,:,:)
real(r8) :: sum, t0, t1
integer :: i,j,k,n
call cpu_time(t0)
allocate(u1(ni,nj,nk,nvar))
allocate(u2(ni,nj,nk,nvar))
allocate(w(ni,nj,nk))
call cpu_time(t1)
write(*,'("allocation time(s):",es15.5)') t1-t0
call cpu_time(t0)
call random_seed()
call random_number(u1)
call random_number(u2)
call random_number(w)
call cpu_time(t1)
write(*,'("random init time (s):",es15.5)') t1-t0
sum = 0.0_r8
call cpu_time(t0)
do n = 1, nvar
do k = 1, nk
do j = 1, nj
do i = 1, ni
sum = sum + u1(i,j,k,n)*u2(i,j,k,n)*w(i,j,k)
end do
end do
end do
end do
call cpu_time(t1)
write(*,'("Sum:",es15.5," time(s):",es15.5)') sum, t1-t0
end program
And the output:
$ gfortran -O2 -o inner_product inner_product.f90
$ time ./inner_product
allocation time(s): 3.00000E-05
random init time (s): 5.73293E+00
Sum: 3.57050E+07 time(s): 5.69066E-01
real 0m6.465s
user 0m4.634s
sys 0m1.798s
Computing the inner product is less that 10% of the runtime in this fortran code. How/What you are timing is very important. Are you sure you are timing the same things in the fortran and python versions? Are you sure you are only timing the inner_product calculation?
This avoids making any copy. (note the blas ddot approach still needs to make a copy for the element-wise product)
subroutine dot3(n,a,b,c,result)
implicit none
real(kind=..) a(*),b(*),c(*),result
integer i,n
result=0
do i=1,n
result=result+a(i)*b(i)*c(i)
enddo
end
dot3 is external, meaning not in a module/contains construct. kind should obviously match main declaration.
in main code:
innerprod=0
do iVar = 1, nVar
call dot3(ni*nj*nk, U1(1,1,1,iVar),U2(1,1,1,iVar),intW,result)
innerProd=innerProd+result
enddo
I had the same observation comparing Numpy and Fortran code.
The difference turns out to be the version of BLAS, I found using DGEMM from netlib is similar to looping and about three times slower than OpenBLAS (see profiles in this answer).
The most surprising thing for me was that OpenBLAS provides code which is so much faster than just compiling a Fortran triple nested loop. It seems this is the whole point of GotoBLAS, which was handwritten in assembly code for the processor architecture.
Even timing the right thing, ordering loops correctly, avoiding copies and using every optimising flag (in gfortran), the performance is still about three times slower than OpenBLAS. I've not tried ifort or pgi, but I wonder if this explains the upvoted comment by #kvantour "loop finishes in 0.6s for me" (note intrinsic matmul is replaced by BLAS in some implementations).

OpenBLAS slower than intrinsic function dot_product

I need make a dot product in Fortran. I can do with the intrinsic function dot_product from Fortran or use ddot from OpenBLAS. The problem is the ddot is slower. This is my code:
With BLAS:
program VectorBLAS
! time VectorBlas.e = 0.30s
implicit none
double precision, dimension(3) :: b
double precision :: result
double precision, external :: ddot
integer, parameter :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K) :: I
DO I = 1, 10000000
b(:) = 3
result = ddot(3, b, 1, b, 1)
END DO
end program VectorBLAS
With dot_product
program VectorModule
! time VectorModule.e = 0.19s
implicit none
double precision, dimension (3) :: b
double precision :: result
integer, parameter :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K) :: I
DO I = 1, 10000000
b(:) = 3
result = dot_product(b, b)
END DO
end program VectorModule
The two codes are compiled using:
gfortran file_name.f90 -lblas -o file_name.e
What am I doing wrong? BLAS not have to be faster?
While BLAS, and especially the optimized versions, are generally faster for larger arrays, the built-in functions are faster for smaller sizes.
This is especially visible from the linked source code of ddot, where additional work is spent on further functionality (e.g., different increments). For small array lengths, the work done here outweighs the performance gain of the optimizations.
If you make your vectors (much) larger, the optimized version should be faster.
Here is an example to illustrate this:
program test
use, intrinsic :: ISO_Fortran_env, only: REAL64
implicit none
integer :: t1, t2, rate, ttot1, ttot2, i
real(REAL64), allocatable :: a(:),b(:),c(:)
real(REAL64), external :: ddot
allocate( a(100000), b(100000), c(100000) )
call system_clock(count_rate=rate)
ttot1 = 0 ; ttot2 = 0
do i=1,1000
call random_number(a)
call random_number(b)
call system_clock(t1)
c = dot_product(a,b)
call system_clock(t2)
ttot1 = ttot1 + t2 - t1
call system_clock(t1)
c = ddot(100000,a,1,b,1)
call system_clock(t2)
ttot2 = ttot2 + t2 - t1
enddo
print *,'dot_product: ', real(ttot1)/real(rate)
print *,'BLAS, ddot: ', real(ttot2)/real(rate)
end program
The BLAS routines are quite a bit faster here:
OMP_NUM_THREADS=1 ./a.out
dot_product: 0.145999998
BLAS, ddot: 0.100000001