Related
I would like to perform many independent operations (e.g. time integration of an ODE with different initial conditions) using MPI and Fortran. The initial conditions are a 2$\times 1000$ vector IC for example.
do i=1,1000
(x0,y0) = (x(i),y(i))
Solve an ODE with (x0,y0) for a time duration
Save the result at the end of this duration
enddo
Can anyone help with a minimal code using MPI or a link to something similar.
I have already used OMP but I think with MPI I would have access to more CPUs
If the operations are truly independent (and the number of cases is a multiple of the number of processors) then:
call mpi_scatter to distribute start points from root
call
call mpi_gather to collect the results back on root
root can then write to file.
If the number of processors doesn't divide into the number of cases then you can use mpi_scatterv and mpi_gatherv instead.
Example (rather trivial work per job, rather than solving ODEs):
program main
use iso_fortran_env
use mpi
implicit none
integer stat(mpi_status_size), tag, ierr
integer size, rank
integer, parameter :: N = 256 * 1000 ! assumes this is a multiple of the number of processors
integer, parameter :: root = 0
integer myN
integer i
real(real64), allocatable :: Y(:), myY(:)
real(real64) start, finish
call mpi_init( ierr )
call mpi_comm_size( mpi_comm_world, size, ierr )
call mpi_comm_rank( mpi_comm_world, rank, ierr )
! Set initial values for full array, then start timing
if ( rank == root ) then
allocate( Y(N) )
Y = [ ( i + 0.0_real64, i = 1, N ) ]
start = gettime()
end if
! Root parcels out the work (i.e., distributes the starting points)
! Processor with rank r will look at indices 1+r*N/size to (r+1)*N/size
myN = N / size
allocate( myY(myN) )
call mpi_scatter( Y, myN, mpi_double_precision, &
myY, myN, mpi_double_precision, root, mpi_comm_world, ierr )
! Each processor does its own work
call myWork( myN, myY )
! Root gets its results back
call mpi_gather ( myY, myN, mpi_double_precision, &
Y, myN, mpi_double_precision, root, mpi_comm_world, ierr )
! Root concludes timing, then writes to file
if ( rank == root ) then
finish = gettime()
write( *, * ) "Time taken = ", finish - start
open( 10, file="output.txt" )
write( 10, "( i8, 1x, es11.4 )" ) ( i, Y(i), i = 1, N )
close( 10 )
deallocate( Y )
end if
deallocate( myY )
call mpi_finalize( ierr )
contains
subroutine myWork( N, Y )
integer , intent(in ) :: N
real(real64), intent(inout) :: Y(N)
integer i
do i = 1, 10000
Y = 2 * Y - Y ! silly example, just to use some flops
end do
end subroutine myWork
real(real64) function getTime()
integer t(8)
call date_and_time( values=t )
getTime = 3600 * t(5) + 60 * t(6) + t(7) + 0.001 * t(8)
end function getTime
end program main
My matrix calculation is: C=C-A*B
Here C is a symmetric matrix so I want to speed up this calculation by considering just the upper triangular and then take the opposite elelement. I used OMP and see that my implementation is slower than the normal calculation for the entire matrix C.
I also see that the calculation for C=C-AxB is slower than C=C+AxB.
My program is attached. Please advise me!
Program testspeed
implicit none
integer nstate,nmeas,i,j,l
integer(kind=8) :: tclock1, tclock2, clock_rate
real(kind=8) :: elapsed_time
double precision, allocatable, dimension(:,:):: B,C,A
nstate =20000
nmeas=10000
allocate (B(nmeas,nstate),C(nstate,nstate),A(nstate,nmeas))
A=1d0
B=1d0
call system_clock(tclock1)
write(*,*) "1"
!$omp parallel do
do j = 1, nstate
do l = 1,nmeas
do i = 1, j
C(j,i) = C(j,i) - A(j,l)*B(l,i)
C(i,j)=C(j,i)
end do
end do
end do
!$omp end parallel do
write(*,*) "2"
call system_clock(tclock2, clock_rate)
elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
write(*,*) elapsed_time
end Program testspeed
One of the basic rules I have taught my students is that nobody should be writing dense matrix multiplies themselves nowadays - and should not have been doing for 30 years +. You should use the BLAS library instead. Below I compare using the BLAS library against your loop ordering and a better loop ordering, and also against the Fortran intrinsic function matmul which I use as a reference to check the results are correct. BLAS and matmul don't take advantage of the symmetry of C, yet they still are the fastest routines - BLAS is about 200-300 times quicker than the loop ordering you have written. Note I have also cut the matrix size down somewhat as I got bored waiting for the original to run for larger cases:
ijb#ijb-Latitude-5410:~/work/stack$ cat mm.f90
Program testspeed
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Use omp_lib, Only : omp_get_max_threads
Implicit None
Integer nstate,nmeas,i,j,l
Integer(li) :: tclock1, tclock2, clock_rate
Real(wp) :: elapsed_time
Real( wp ), Allocatable, Dimension(:,:):: B,C,A
Real( wp ), Allocatable, Dimension(:,:):: C_test
Real( wp ), Allocatable, Dimension(:,:):: C_start
Write( *, * ) 'Using ', omp_get_max_threads(), ' threads'
!!$ nstate =2000
!!$ nmeas=1000
nstate = 5000
nmeas = 2500
Allocate (B(nmeas,nstate),C(nstate,nstate),A(nstate,nmeas))
Allocate( C_test, Mold = C )
Allocate( C_start, Mold = C )
!!$ A=1.0_wp
!!$ B=1.0_wp
! Random numbers are a much better test
Call Random_number( A )
B = Transpose( A ) ! make sure result is symmetric
Call Random_number( C_start )
! Make Initial C Symmetric
C_start = 0.5_wp * ( C_start + Transpose( C_start ) )
Write( *, * ) 'Matix sizes ', Shape( A ), Shape( B ), Shape( C )
C_test = C_start
Call system_Clock(tclock1)
C_test = C_test - Matmul( A, B )
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write( *,'( a, t20, f8.3 )' ) 'Matmul', elapsed_time
C = C_start
Call system_Clock(tclock1)
!$omp parallel do
Do j = 1, nstate
Do l = 1,nmeas
Do i = 1, j
C(j,i) = C(j,i) - A(j,l)*B(l,i)
C(i,j)=C(j,i)
End Do
End Do
End Do
!$omp end parallel do
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write(*,'( a, t20, f8.3, t30, "Max error ", g20.12 )' ) &
'Orig loops', elapsed_time, Maxval( Abs( C_test - C ) )
C = C_start
Call system_Clock(tclock1)
!$omp parallel default( none ) shared ( nstate, nmeas, A, B, C ), private( i, j, l )
!$omp do
Do i = 1, nstate
Do l = 1,nmeas
Do j = 1, i
C(j,i) = C(j,i) - A(j,l)*B(l,i)
End Do
End Do
End Do
!$omp end do
!$omp do
Do i = 1, nstate
Do j = 1, i
C( i, j ) = C( j, i )
End Do
End Do
!$omp end do
!$omp end parallel
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write(*,'( a, t20, f8.3, t30, "Max error ", g20.12 )' ) &
'Sensible loops', elapsed_time, Maxval( Abs( C_test - C ) )
C = C_start
Call system_Clock(tclock1)
Call dgemm( 'N', 'N', nstate, nstate, nmeas, -1.0_wp, A, Size( A, Dim = 1 ), &
B, Size( B, Dim = 1 ), &
+1.0_wp, C, Size( C, Dim = 1 ) )
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write(*,'( a, t20, f8.3, t30, "Max error ", g20.12 )' ) &
'BLAS ', elapsed_time, Maxval( Abs( C_test - C ) )
End Program testspeed
ijb#ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb#ijb-Latitude-5410:~/work/stack$ gfortran -fopenmp -Wall -Wextra -std=f2018 -O3 mm.f90 -lopenblas
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=1
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 1 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.793
Orig loops 421.564 Max error 0.488853402203E-11
Sensible loops 20.742 Max error 0.488853402203E-11
BLAS 2.185 Max error 0.682121026330E-12
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=2
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 2 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.968
Orig loops 324.319 Max error 0.466116034659E-11
Sensible loops 17.656 Max error 0.466116034659E-11
BLAS 1.161 Max error 0.682121026330E-12
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=3
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 3 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.852
Orig loops 243.268 Max error 0.500222085975E-11
Sensible loops 15.802 Max error 0.500222085975E-11
BLAS 0.852 Max error 0.682121026330E-12
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=4
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 4 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.994
Orig loops 189.189 Max error 0.477484718431E-11
Sensible loops 14.245 Max error 0.477484718431E-11
BLAS 0.707 Max error 0.682121026330E-12
For BLAS I have used openblas - which is freely available. On Linux system a simple apt get or similar should be enough.
Please also note
If you have to write your own loops your inner most loop should go, if possible, over the first index of your array. This is because Fortran orders its arrays as column major. This is what I have done in the "sensible" loop ordering
Real( 8 ) is not portable, not guaranteed to be supported by your compiler, and not guaranteed to do what you expect, and shouldn't be used. Similar for Integer( 8 ). Please see what I have done for a better way that should work with all compilers.
Float is not a standard intrinsic - use Real as I have done
As benchmarks are meaningless if the results are incorrect you should always include a way to test the results. Here I use the Fortran intrinsic matmul to provide a reference version. Your original code does not initialise C, so the results can not be trusted - but as you don't check you get the correct values for C you can't know this.
I personally dislike !$omp parallel do intensely, I think it a mistake that such short cuts are in OpenMP. Instead separate them into !$omp parallel and !$omp do - it is very important to understand that thread creation and work sharing are different things, convoluting them in one line is not a good way to learn OpenMP.
Program Main
implicit none
include 'mpif.h'
!Define parameters
integer::my_rank,p2,n2,ierr,source
integer, parameter :: n=3,m=3,o=m*n
real(kind=8) aaa(n),ddd(n),bbb(n),ccc(n),xxx(n),b(m,n),start, finish
integer i, j
real h
real(kind=8),dimension(:),allocatable::sol1
h=0.25
b=0
do i=1,m
b(i,i)=1/(1.2**i)
b(i,i-1)=-b(i,i)
enddo
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,p2,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank,ierr)
allocate(sol1(o))
start=MPI_WTIME()
do i=1,n
aaa(i)=-1/h**2
bbb(i)=2/h**2+b(my_rank+1,my_rank+1)
ccc(i)=-1/h**2
ddd(i)=1/h**2
enddo
call thomas(aaa,bbb,ccc,ddd,xxx,n)
finish=MPI_WTIME()
print*, finish-start
write(*,*) xxx, my_rank
call MPI_GATHER(xxx,n, MPI_REAL, sol1,n,MPI_REAL8,0, MPI_COMM_WORLD,ierr)
print*,sol1
call MPI_FINALIZE(ierr)
end program main
subroutine thomas(ld,md,ud,rh,solution,n)
implicit none
integer,parameter :: r8 = kind(1.d0)
integer,intent(in) :: n
real(r8),dimension(n),intent(in) :: ld,md,ud,rh
real(r8),dimension(n),intent(out) :: solution
real(r8),dimension(n) :: P,Q
real(r8) :: m
integer i
P(1) = ud(1)/md(1)
Q(1) = rh(1)/md(1)
do i = 2,n
m = md(i)-p(i-1)*ld(i)
P(i) = ud(i)/m
Q(i) = (rh(i)-Q(i-1)*ld(i))/m
end do
solution(n) = Q(n)
do i = n-1, 1, -1
solution(i) = Q(i)-P(i)*solution(i+1)
end do
end subroutine thomas
Here I used MPI_WTIME() to find the execution time. It seems like when I increase the number of processor than I am not getting the speedup. In this code I have m=3 (I make m equal equal to no of processor). I run with mpirun -np 3 sp.exe). Now I change say m=10 and run with mpirun -np 10 sp.exe. I should get the less time, isn't it? or I am missing something here. The community helped me before with some issues and now I am getting another issue. I would really appreciate the help if somebody would point out something.Isn't the chunk of code starting with do loop done by invidual processors( which I want)?
I have some a problem with my main code, so I tried to isolate the problem.
Therefore, I have this small code :
MODULE Param
IMPLICIT NONE
integer, parameter :: dr = SELECTED_REAL_KIND(15, 307)
integer :: D =3
integer :: Q=10
integer :: mmo=16
integer :: n=2
integer :: x=80
integer :: y=70
integer :: z=20
integer :: tMax=8
END MODULE Param
module m
contains
subroutine compute(f, r)
USE Param, ONLY: dr, mmo, x, y, z, n
IMPLICIT NONE
real (kind=dr), intent(in) :: f(x,y,z, 0:mmo, n)
real (kind=dr), intent(out) :: r(x, y, z, n)
real (kind=dr) :: fGlob(x,y,z, 0:mmo)
!-------------------------------------------------------------------------
print*, 'We are in compute subroutine'
r= 00.0
fGlob=sum(f,dim=5)
r=sum(f, dim=4)
print*, 'fGlob=', fGlob(1,1,1, 1)
print*, 'f=', f(1,1,1, 0,1)
print*, 'r=', r(1,1,1, 1)
end subroutine compute
end module m
PROGRAM test_prog
USE Param
USE m
Implicit None
integer :: tStep
real (kind=dr), dimension(:,:,:, :,:), allocatable :: f
real (kind=dr), dimension(:,:,:,:), allocatable :: r
!----------------------------------------------------------------------------
! Initialise the parameters.
print*, 'beginning of the test'
! Allocate
allocate(f(x,y,z, 0:mmo,n))
allocate(r(x,y,z, n))
f=1.0_dr
! ---------------------------------------------------------
! Iteration over time
! ---------------------------------------------------------
do tStep = 1, tMax
print *, tStep
call compute(f,r)
f=f+1
print *, 'tStep', tStep
enddo
print*, 'f=', f(1,1,1, 0,1)
print*, 'r=', r(1,1,1, 1)
! Deallacation
deallocate(f)
deallocate(r)
print*, 'End of the test program'
END PROGRAM test_prog
For now, I am not able to understand why when I compile with ifort, I have a segmentation fault, and it works when I compile with gfortran. And worst, when I compile with both ifort and gfortran with their fast options, I get again a segmentation fault (core dumped) error. And more confusing, when I also tried with both compilers to compile with traceback options, everything works fine.
I know that segmentation fault (core dumped) error usually means that I try to read or write in a wrong location (matrix indices etc...); but here with this small code, I see no mistake like this.
Does anyone can help me to understand why theses errors occur?
The problem comes from the size of the stack used by some compilers by default (ifort) or by some others when they optimise the compilation (gfortran -Ofast). Here, our writings exceed the size of the stack.
To solve this, I use the options -heap-arrays for ifort compiler and -fno-stack-arrays for gfortran compiler.
I'm trying to break up a 4D array over the third dimension, and send to each node using MPI. Basically, I'm computing derivatives of a matrix, Cpq, with respect to atom positions in each of the three cartesian directions. Cpq is of size nat_sl x nat_sl, so dCpqdR is of size nat_sl x nat_sl x nat x 3. At the end of the day, for ever s,i pair, I have to compute the matrix product of dCpqdR between the transpose of the eigenvectors of Cpq and the eigenvectors of Cpq like so:
temp = MATMUL(TRANSPOSE(Cpq), MATMUL(dCpqdR(:, :, s, i), Cpq))
This is fine, but as it turns out, the loop over s and i is now by far the slow part of my code. Because each can be done independently, I was hoping that I could break up dCpqdR, and give each task it's own s, i to compute the derivative of. That is, I'd like task 1 to get dCpqdR(:,:,1,1), task 2 to get dCpqdR(:,:,1,2), etc.
I've got this working in some sense by using a buffered send/recv pair of calls. The root node allocates a temporary array, fills it, sends to the relevant nodes, and the relevant nodes do their computations as they wish. This is fine, but can be slow and memory inefficient. I'd ideally like to break it up in a more memory efficient way.
The logical thing to do, then, is to use mpi_scatterv, but here is where I start running into trouble, as I'm having trouble figuring out the memory layout for this. I've written this, so far:
call mpi_type_create_subarray(4, (/ nat_sl, nat_sl, nat, 3 /), (/nat_sl, nat_sl, n_pairs(me_image+1), 3/),&
(/0, 0, 0, 0/), mpi_order_fortran, mpi_double_precision, subarr_typ, ierr)
call mpi_type_commit(subarr_typ, ierr)
call mpi_scatterv(dCpqdR, n_pairs(me_image+1), f_displs, subarr_typ,&
my_dCpqdR, 3*nat_sl*3*nat_sl*3*n_pairs(me_image+1), subarr_typ,&
root_image, intra_image_comm, ierr)
I've computed n_pairs using this subroutine:
subroutine mbdvdw_para_init_int_forces()
implicit none
integer :: p, s, i, counter, k, cpu_ind
integer :: num_unique_rpq, n_pairs_per_proc, cpu
real(dp) :: Rpq(3), Rpq_norm, current_val
num_pairs = nat
if(.not.allocated(f_cpu_id)) allocate(f_cpu_id(nat, 3))
n_pairs_per_proc = floor(dble(num_pairs)/nproc_image)
cpu = 0
n_pairs = 0
counter = 1
p = 1
do counter = 0, num_pairs-1, 1
n_pairs(modulo(counter, nproc_image)+1) = n_pairs(modulo(counter, nproc_image)+1) + 1
end do
do s = 1, nat, 1
f_cpu_id(s) = cpu
if((counter.lt.num_pairs)) then
if(p.eq.n_pairs(cpu+1)) then
cpu = cpu + 1
p = 0
end if
end if
p = p + 1
end do
call mp_set_displs( n_pairs, f_displs, num_pairs, nproc_image)
f_displs = f_displs*nat_sl*nat_sl*3
end subroutine mbdvdw_para_init_int_forces
and the full method for the matrix multiplication is
subroutine mbdvdw_interacting_energy(energy, forcedR, forcedh, forcedV)
implicit none
real(dp), intent(out) :: energy
real(dp), dimension(nat, 3), intent(out) :: forcedR
real(dp), dimension(3,3), intent(out) :: forcedh
real(dp), dimension(nat), intent(out) :: forcedV
real(dp), dimension(3*nat_sl, 3*nat_sl) :: temp
real(dp), dimension(:,:,:,:), allocatable :: my_dCpqdR
integer :: num_negative, i_atom, s, i, j, counter
integer, parameter :: eigs_check = 200
integer :: subarr_typ, ierr
! lapack work variables
integer :: LWORK, errorflag
real(dp) :: WORK((3*nat_sl)*(3+(3*nat_sl)/2)), eigenvalues(3*nat_sl)
call start_clock('mbd_int_energy')
call mp_sum(Cpq, intra_image_comm)
eigenvalues = 0.0_DP
forcedR = 0.0_DP
energy = 0.0_DP
num_negative = 0
forcedV = 0.0_DP
errorflag=0
LWORK=3*nat_sl*(3+(3*nat_sl)/2)
call DSYEV('V', 'U', 3*nat_sl, Cpq, 3*nat_sl, eigenvalues, WORK, LWORK, errorflag)
if(errorflag.eq.0) then
do i_atom=1, 3*nat_sl, 1
!open (unit=eigs_check, file="eigs.tmp",action="write",status="unknown",position="append")
! write(eigs_check, *) eigenvalues(i_atom)
!close(eigs_check)
if(eigenvalues(i_atom).ge.0.0_DP) then
energy = energy + dsqrt(eigenvalues(i_atom))
else
num_negative = num_negative + 1
end if
end do
if(num_negative.ge.1) then
write(stdout, '(3X," WARNING: Found ", I3, " Negative Eigenvalues.")'), num_negative
end if
else
end if
energy = energy*nat/nat_sl
!!!!!!!!!!!!!!!!!!!!
! Forces below here. There's going to be some long parallelization business.
!!!!!!!!!!!!!!!!!!!!
call start_clock('mbd_int_forces')
if(.not.allocated(my_dCpqdR)) allocate(my_dCpqdR(nat_sl, nat_sl, n_pairs(me_image+1), 3)), my_dCpqdR = 0.0_DP
if(mbd_vdw_forces) then
do s=1,nat,1
if(me_image.eq.(f_cpu_id(s)+1)) then
do i=1,3,1
temp = MATMUL(TRANSPOSE(Cpq), MATMUL(my_dCpqdR(:, :, counter, i), Cpq))
do j=1,3*nat_sl,1
if(eigenvalues(j).ge.0.0_DP) then
forcedR(s, i) = forcedR(s, i) + 1.0_DP/(2.0_DP*dsqrt(eigenvalues(j)))*temp(j,j)
end if
end do
end do
counter = counter + 1
end if
end do
forcedR = forcedR*nat/nat_sl
do s=1,3,1
do i=1,3,1
temp = MATMUL(TRANSPOSE(Cpq), MATMUL(dCpqdh(:, :, s, i), Cpq))
do j=1,3*nat_sl,1
if(eigenvalues(j).ge.0.0_DP) then
forcedh(s, i) = forcedh(s, i) + 1.0_DP/(2.0_DP*dsqrt(eigenvalues(j)))*temp(j,j)
end if
end do
end do
end do
forcedh = forcedh*nat/nat_sl
call mp_sum(forcedR, intra_image_comm)
call mp_sum(forcedh, intra_image_comm)
end if
call stop_clock('mbd_int_forces')
call stop_clock('mbd_int_energy')
return
end subroutine mbdvdw_interacting_energy
But when run, it's complaining that
[MathBook Pro:58100] *** An error occurred in MPI_Type_create_subarray
[MathBook Pro:58100] *** reported by process [2560884737,2314885530279477248]
[MathBook Pro:58100] *** on communicator MPI_COMM_WORLD
[MathBook Pro:58100] *** MPI_ERR_ARG: invalid argument of some other kind
[MathBook Pro:58100] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[MathBook Pro:58100] *** and potentially your MPI job)
so something is going wrong, but I have no idea what. I know my description is somewhat sparse to start with, so please let me know what information would be necessary to help.