How to parallelize the nested loop - fortran

A small example serial code, which has the same structure as my code, is shown below.
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j
DOUBLE PRECISION :: en,ei,es
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(i) + DBLE(j)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.
Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).
I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?
Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?
On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?
Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.

As mentioned in the comments, a good answer will require more detailed question. However, at a first sight it seems that parallelizing the internal loop
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
should be enough to solve your problem, or at least it will be a good starter. First of all I guess that there is an error on the loop
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(j,k) = DBLE(j) + DBLE(k)
END DO
END Do
since the k is set to 0 and and there is no cell with address corresponding to 0 (see your variable declaration). Also ki is declared ki(1000,2000) array while ki(j,i) is (2000,1000) array. Beside these error, I guess that ki should be calculated as
ki(i,j) = DBLE(j) + DBLE(i)
if true, I suggest you the following solution
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j, k,icr,icr0,icr1
DOUBLE PRECISION :: en,ei,es,timerRate
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
INTEGER,PARAMETER:: nthreads=1
call system_clock(count_rate=icr)
timerRate=real(icr)
call system_clock(icr0)
call omp_set_num_threads(nthreads)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(j) + DBLE(i)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
!$OMP PARALLEL DO private(j,kn,ei) firstpribate(en) shared(ki) reduction(+:es)
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
!$OMP END PARALLEL DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
call system_clock(icr1)
write (*,*) (icr1-icr0)/timerRate ! return computing time
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I add some variables to check the computing time ;-).
This solution is computed in 5.14 s, for nthreads=1, and in 2.75 s, for nthreads=2. It does not divide the computing time by 2, but it seems to be a good deal for a first shot. Unfortunately, on this machine I have a core i3 proc. So I can't do better than nthreads=2. However, I wonder, how the code will behave with nthreads=16 ???
Please let me know
I hope that this helps you.
Finally, I warn about the choice of variables status (private, firstprivate and shared) that might be consider carefully in the real code.

Related

if statement to determine steady-state

My code below correctly solves a 1D heat equation for a function u(x,t). I now want to find the steady-state solution, the solution that no longer changes in time so it should satisfy u(t+1)-u(t) = 0. What is the most efficient way to find the steady-state solution? I show three different attempts below, but I'm not sure if either are actually doing what I want. The first and third have correct syntax, the second method has a syntax error due to the if statement. Each method is different due to the change in the if structure.
Method 1 :
program parabolic1
integer, parameter :: n = 10, m = 20
real, parameter :: h = 0.1, k = 0.005 !step sizes
real, dimension (0:n) :: u,v
integer:: i,j
real::pi,pi2
u(0) = 0.0; v(0) = 0.0; u(n) = 0.0; v(n) =0.0
pi = 4.0*atan(1.0)
pi2 = pi*pi
do i=1, n-1
u(i) = sin( pi*real(i)*h)
end do
do j = 1,m
do i = 1, n-1
v(i) = 0.5*(u(i-1)+u(i+1))
end do
t = real(j)*k !increment in time, now check for steady-state
!steady-state check: this checks the solutions at every space point which I don't think is correct.
do i = 1,n-1
if ( v(i) - u(i) .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
end do
do i = 1, n-1 !updating solution
u(i) = v(i)
end do
end do
end program parabolic1
Method 2 :
program parabolic1
integer, parameter :: n = 10, m = 20
real, parameter :: h = 0.1, k = 0.005 !step sizes
real, dimension (0:n) :: u,v
integer:: i,j
real::pi,pi2
u(0) = 0.0; v(0) = 0.0; u(n) = 0.0; v(n) =0.0
pi = 4.0*atan(1.0)
pi2 = pi*pi
do i=1, n-1
u(i) = sin( pi*real(i)*h)
end do
do j = 1,m
do i = 1, n-1
v(i) = 0.5*(u(i-1)+u(i+1))
end do
t = real(j)*k !increment in time, now check for steady-state
!steady-state check: (This gives an error message since the if statement doesn't have a logical scalar expression, but I want to compare the full arrays v and u as shown.
if ( v - u .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
do i = 1, n-1 !updating solution
u(i) = v(i)
end do
end do
end program parabolic1
Method 3 :
program parabolic1
integer, parameter :: n = 10, m = 20
real, parameter :: h = 0.1, k = 0.005 !step sizes
real, dimension (0:n) :: u,v
integer:: i,j
real::pi,pi2
u(0) = 0.0; v(0) = 0.0; u(n) = 0.0; v(n) =0.0
pi = 4.0*atan(1.0)
pi2 = pi*pi
do i=1, n-1
u(i) = sin( pi*real(i)*h)
end do
do j = 1,m
do i = 1, n-1
v(i) = 0.5*(u(i-1)+u(i+1))
end do
t = real(j)*k !increment in time, now check for steady-state
!steady-state check: Perhaps this is the correct expression I want to use
if( norm2(v) - norm2(u) .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
do i = 1, n-1 !updating solution
u(i) = v(i)
end do
end do
end program parabolic1
Without discussing which method to determine "closeness" is best or correct (not really being a programming problem) we can focus on what the Fortran parts of the methods are doing.
Method 1 and Method 2 are similar ideas (but broken in their execution), while Method 3 is different (and broken in another way).
Note also that in general one wants to compare the magnitude of the difference abs(v-u) rather than the (signed) difference v-u. With non-monotonic changes over iterations these are quite different.
Method 3 uses norm2(v) - norm2(u) to test whether the arrays u and v are similar. This isn't correct. Consider
norm2([1.,0.])-norm2([0.,1.])
instead of the more correct
norm2([1.,0.]-[0.,1.])
Method 2's
if ( v - u .LT. 1.0e-7 ) then
has the problem of being an invalid array expression, but the "are all points close?" can be written appropriately as
if ( ALL( v - u .LT. 1.0e-7 )) then
(You'll find other questions around here about such array reductions).
Method 1 tries something similar, but incorrectly:
do i = 1,n-1
if ( v(i) - u(i) .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
end do
This is incorrect in one big way, and one subtle way.
First, the loop is exited when the condition tests true the first time, with a message saying the steady state is reached. This is incorrect: you need all values close, while this is testing for any value close.
Second, when the condition is met, you exit. But you don't exit the time iteration loop, you exit the closeness testing loop. (exit without a construct name leaves the innermost do construct). You'll be in exactly the same situation, running again immediately after this innermost construct whether the tested condition is ever or never met (if ever met you'll get the message also). You will need to use a construct name on the time loop.
I won't show how to do that (again there are other questions here about that), because you also need to fix the test condition, by which point you'll be better off using if(all(... (corrected Method 2) without that additional do construct.
For Methods 1 and 2 you'll have something like:
if (all(v-u .lt 1e-7)) then
print *, "Converged"
exit
end if
And for Method 3:
if (norm2(v-u) .lt. 1e-7) then
print *, "Converged"
exit
end if

Intensity using bessel function tending to infinity

Making it short, my code is supposed to return a txt with my intensity values, instead, for all rs but 0, my intensity returns a value of +infinity. Don't know where my mistake is. This exercise is supposed to make us practice integration via Simpson's 1/3 method. All Bessels Jx txt values are working fine, the only problem relies within my intensity file. First code block is responsible for creating and filling Bessel Jx values in a txt file. Second part is responsible for creating and filling intensity values through a Bessel function (this is where the error is supposed to be, but i'm not sure). Third and fourth blocks are the Simpson 1/3 method and my Bessel function, respectively.
program intensidade
implicit none
real,parameter::pi=acos(-1.),lambda=500e-9
real::k,r,kr,intensidade1
real,external::bessel,simpson13
real::i
integer::j
open(0,file='besselj0.txt')
open(1,file='besselj1.txt')
open(2,file='besselj2.txt')
open(3,file='intensidade.txt')
do j=0,2
i=0
do while (i<=20)
write(j,*)i,simpson13(bessel,j,i,0.,pi)
i=i+1
enddo
enddo
close(0);
close(1);
close(2);
r=0
k=2*pi/lambda
kr=k*r*10e-6
do while (r<=1)
if(r==0) then
write(3,*)r,(1/2)**2
else
write(3,*)r,(simpson13(bessel,1,kr,0.,pi)/kr)**2
endif
r=r+0.1
enddo
close(3)
pause
end program intensidade
real function simpson13(funcao,m,x,a,b)
implicit none
real,external::funcao
real,intent(in)::a,b,x
integer,intent(in)::m
integer::i
real::h
h=(b-a)/1000
simpson13=funcao(m,x,a)-funcao(m,x,b)
do i=1,499
simpson13=simpson13+4*funcao(m,x,a+h*(2*i-1))+2*funcao(m,x,a+2*h*i)
enddo
simpson13=(h/3)*simpson13
end function simpson13
real function bessel(m,x,teta)
implicit none
real,parameter::pi=acos(-1.)
real,intent(in)::x,teta
integer,intent(in)::m
bessel=cos(m*teta-x*sin(teta))/pi
end function bessel
The main error arises because kr is not redefined in each loop.
Other improvements
style: align and indent your code
use file units provided by the system, i.e. open (newunit=...)
remove the pause command
Furthermore, your line write(..)r,(1/2)**2 uses integer arithmetic s.t. 1/2 yields zero and (1/2)**2 is zero as well.
The following is a possible way to rewrite your program
program intensidade
implicit none
real, parameter :: pi=acos(-1.0), lambda=500e-9
real :: k, r, kr
real, external :: bessel, simpson13
integer :: ir, funit, ix, m
character(128) :: fname
do m = 0, 2
write (fname, "(A7, I1, A4)") 'besselj', m, '.txt'
open (newunit=funit, file=fname)
do ix = 0, 20
write (funit, *) ix, simpson13(bessel, m, real(ix), 0.0, pi)
end do
close (funit)
end do
open (newunit=funit, file='intensidade.txt')
r = 0
k = 2*pi/lambda
write (funit, *) r, (0.5)**2
do ir = 1, 10
r = ir/10.0
kr = k*r*10e-6
write (funit, *) r, (simpson13(bessel, 1, kr, 0.0, pi)/kr)**2
end do
close (funit)
end program

Openmp parallel do loop working correctly ~50% of the time

I am currently working on adding openmp parallelization for a do loop in one of the codes I have written for research. I am fairly new to using openmp so I would appreciate if you had any suggestions for what might be going wrong.
Basically, I have added a parallel do loop to the following code (which works prior to parallelization). r(:,:,:,:) is a vector of a ton of molecular coordinates indexed by time, molecule, atom, and (xyz). This vector is about 100 gb of data (I am working on an HPC with plenty of RAM). I am trying to parallelize the outer loop and subdivide it between processors so that I can reduce the amount of time this calculation goes. I thought it would be a good one to do it with as msd and cm_msd are the only things that would need to be edited by multiple processors and stored for later, which since each iteration gets its own element of these arrays they won't have a race condition.
The problem: If I run this code 5 times I get varying results, sometimes msd is calculated correctly (or appears to be), and sometimes it outputs all zeros later when I average it together. Without parallelization there are no issues.
I have been trying altering the shared vs private variables in the code and I think I have accounted for everything. The i index of the msd array and msd_cm array should never be equivalent between threads so I would think that they wouldn't be an issue.
! Loop over time origins
counti = 0
ind = 0
!$OMP PARALLEL DO schedule(static) PRIVATE(i,j,k,it,r_old,r_cm_old,shift,shift_cm,dsq,ind) &
!$OMP& SHARED(msd,msd_cm)
do i=1, nconfigs-nt, or_int
if(MOD(counti*or_int,500) == 0) then
write(*,*) 'Reached the ', counti*or_int,'th time origin'
end if
! Set the Old Coordinates
counti = counti + 1
ind = (i-1)/or_int + 1
r_old(:,:,:) = r(i,:,:,:)
r_cm_old(:,:) = r_cm(i,:,:)
shift = 0.0
shift_cm = 0.0
! Loop over the timesteps in each trajectory
do it=i+2, nt+i
! Loop over molecules
do j = 1, nmols
do k=1, atms_per_mol
! Calculate the shift if it occurs.
shift(j,k,:) = shift(j,k,:) - L(:)*anint((r(it,j,k,:) - &
r_old(j,k,:) )/L(:))
! Calculate the square displacements
dsq = ( r(it,j,k,1) + shift(j,k,1) - r(i,j,k,1) ) ** 2. &
+( r(it,j,k,2) + shift(j,k,2) - r(i,j,k,2) ) ** 2. &
+( r(it,j,k,3) + shift(j,k,3) - r(i,j,k,3) ) ** 2.
msd(ind, it-1-i, k) = msd(ind, it-1-i, k) + dsq
! Calculate the contribution to the c1,c2
enddo ! End Atoms Loop (k)
! Calculate the shift if it occurs.
shift_cm(j,:) = shift_cm(j,:) - L(:)*anint((r_cm(it,j,:) - &
r_cm_old(j,:) )/L(:))
! Calculate the square displacements
dsq = ( r_cm(it,j,1) + shift_cm(j,1) - r_cm(i,j,1) ) ** 2. &
+( r_cm(it,j,2) + shift_cm(j,2) - r_cm(i,j,2) ) ** 2. &
+( r_cm(it,j,3) + shift_cm(j,3) - r_cm(i,j,3) ) ** 2.
msd_cm(ind,it-1-i) = msd_cm(ind, it-1-i) + dsq
enddo ! End Molecules Loop (j)
r_old(:,:,:) = r(it,:,:,:)
r_cm_old(:,:) = r_cm(it,:,:)
enddo ! End t's loop (it)
enddo
!$OMP END PARALLEL DO
When this code is run, when I later print the averaged msd results out they either come out as correctly or they come out as zero and it is always one or the other. Do you see an issue that might explain why it is only working part of the time. I am brand new to openmp so it is completely possible there is just something incredibly stupid with how I am trying to do this.
Thanks in advance!

How to efficiently calculate matrix inner product in Fortran?

I am trying to calculate something similar to a weighted matrix inner product in Fortran. The current script that I am using for calculating the inner product is as follows
! --> In
real(kind=8), intent(in), dimension(ni, nj, nk, nVar) :: U1, U2
real(kind=8), intent(in), dimension(ni, nj, nk) :: intW
! --> Out
real(kind=8), intent(out) :: innerProd
! --> Local
integer :: ni, nj, nk, nVar, iVar
! --> Computing inner product
do iVar = 1, nVar
innerProd = innerProd + sum(U1(:,:,:,iVar)*U2(:,:,:,iVar)*intW)
enddo
But I found that the above script that I am currently using is not very efficient. The same operation can be performed in Python using NumPy as follows,
import numpy as np
import os
# --> Preventing numpy from multi-threading
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
innerProd = 0
# --> Toy matrices
U1 = np.random.random((ni,nj,nk,nVar))
U2 = np.random.random((ni,nj,nk,nVar))
intW = np.random.random((ni,nj,nk))
# --> Reshaping
U1 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
U2 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
intW = np.reshape(np.ravel(intW), (ni*nj*nk))
# --> Calculating inner product
for iVar in range(nVar):
innerProd = innerProd + np.dot(U1[:, iVar], U2[:, iVar]*intW)
The second method using Numpy seems to be far more faster than the method using Fortran. For a specific case of ni = nj = nk = nVar = 130, the time taken by the two methods are as follows
fortran_time = 25.8641 s
numpy_time = 6.8924 s
I tried improving my Fortran code with ddot from BLAS as follows,
do iVar = 1, nVar
do k = 1, nk
do j = 1, nj
innerProd = innerProd + ddot(ni, U1(:,j,k,iVar), 1, U2(:,j,k,iVar)*intW(:,j,k), 1)
enddo
enddo
enddo
But there was no considerable improvement in time. The time taken by the above method for the case of ni = nj = nk = nVar = 130 is ~24s. (I forgot to mention that I compiled the Fortran code with '-O2' option for optimizing the performance).
Unfortunately, there is no BLAS function for element-wise matrix multiplication in Fortran. And I don't want to use reshape in Fortran because unlike python reshaping in Fortran will lead to copying my array to a new array leading to more RAM usage.
Is there any way to speed up the performance in Fortran so as to get close to the performance of Numpy?
You may not be timing what you think are timing. Here's a complete fortran example
program test
use iso_fortran_env, r8 => real64
implicit none
integer, parameter :: ni = 130, nj = 130, nk = 130, nvar = 130
real(r8), allocatable :: u1(:,:,:,:), u2(:,:,:,:), w(:,:,:)
real(r8) :: sum, t0, t1
integer :: i,j,k,n
call cpu_time(t0)
allocate(u1(ni,nj,nk,nvar))
allocate(u2(ni,nj,nk,nvar))
allocate(w(ni,nj,nk))
call cpu_time(t1)
write(*,'("allocation time(s):",es15.5)') t1-t0
call cpu_time(t0)
call random_seed()
call random_number(u1)
call random_number(u2)
call random_number(w)
call cpu_time(t1)
write(*,'("random init time (s):",es15.5)') t1-t0
sum = 0.0_r8
call cpu_time(t0)
do n = 1, nvar
do k = 1, nk
do j = 1, nj
do i = 1, ni
sum = sum + u1(i,j,k,n)*u2(i,j,k,n)*w(i,j,k)
end do
end do
end do
end do
call cpu_time(t1)
write(*,'("Sum:",es15.5," time(s):",es15.5)') sum, t1-t0
end program
And the output:
$ gfortran -O2 -o inner_product inner_product.f90
$ time ./inner_product
allocation time(s): 3.00000E-05
random init time (s): 5.73293E+00
Sum: 3.57050E+07 time(s): 5.69066E-01
real 0m6.465s
user 0m4.634s
sys 0m1.798s
Computing the inner product is less that 10% of the runtime in this fortran code. How/What you are timing is very important. Are you sure you are timing the same things in the fortran and python versions? Are you sure you are only timing the inner_product calculation?
This avoids making any copy. (note the blas ddot approach still needs to make a copy for the element-wise product)
subroutine dot3(n,a,b,c,result)
implicit none
real(kind=..) a(*),b(*),c(*),result
integer i,n
result=0
do i=1,n
result=result+a(i)*b(i)*c(i)
enddo
end
dot3 is external, meaning not in a module/contains construct. kind should obviously match main declaration.
in main code:
innerprod=0
do iVar = 1, nVar
call dot3(ni*nj*nk, U1(1,1,1,iVar),U2(1,1,1,iVar),intW,result)
innerProd=innerProd+result
enddo
I had the same observation comparing Numpy and Fortran code.
The difference turns out to be the version of BLAS, I found using DGEMM from netlib is similar to looping and about three times slower than OpenBLAS (see profiles in this answer).
The most surprising thing for me was that OpenBLAS provides code which is so much faster than just compiling a Fortran triple nested loop. It seems this is the whole point of GotoBLAS, which was handwritten in assembly code for the processor architecture.
Even timing the right thing, ordering loops correctly, avoiding copies and using every optimising flag (in gfortran), the performance is still about three times slower than OpenBLAS. I've not tried ifort or pgi, but I wonder if this explains the upvoted comment by #kvantour "loop finishes in 0.6s for me" (note intrinsic matmul is replaced by BLAS in some implementations).

Reduction on array in FORTRAN

I'm trying to parallelize a module in my FORTRAN code using OpenMP and I'm running into some issues with threads overwriting updated values in the array. Obviously my first instinct was to do a reduction, but I'm not really sure how to go about it in this context, as I've only done it in a simple x = x + update kind of situation, where-as this is similar, but does so in a normally out-of-order fashion, and also in an array.
subroutine chargeInterp(q,x,xmin,xmax,dg,np,ng)
real(kind = 8) :: charge, dg, xmin, weight, xmax,wp
integer :: g1,g2,g1temp,g2temp,i,np,ng
real(kind = 8), dimension(np) :: q,x
!$OMP PARALLEL DO PRIVATE(g1,g2) REDUCTION(+:q)
do i=1,np
g1 = floor((x(i)-xmin)/dg)
g2 = g1 + 1
wp=((x(i)-xmin)/dg-g1)
weight=1-wp
q(g1+1) = q(g1+1) - weight
q(g2+1) = q(g2+1) - wp
enddo
!$OMP END PARALLEL DO
Just to give a rundown of what it's doing, essentially it's taking the position of a particle and weighting its charge onto adjacent grid points on the mesh.
Thanks for the help!
P.S. The omp statements wrapped around the loop don't work. Just throwing that one out there. Have also tried !$OMP ATOMIC before updating q. Compiles and runs, but my results don't match my un-parallelized results, so it's a no-go.