Fortran OpenMP code much slower than its not parallel version - fortran

I want to solve the Random Walk problem, so i wrote a fortran sequental code and now i need to parallel this code.
subroutine random_walk(walkers)
implicit none
include "omp_lib.h"
integer :: i, j, col, row, walkers,m,n,iter
real, dimension(:, :), allocatable :: matrix, res
real :: point, z
col = 12
row = 12
allocate (matrix(row, col), res(row, col))
! Read from file
open(2, file='matrix.txt')
do i = 1, row
read(2, *)(matrix(i, j), j=1,col)
end do
res = matrix
! Solve task
!$omp parallel private(i,j,m,n,point,iter)
!$omp do collapse(2)
do i= 2, 11
do j=2, 11
m = i
n = j
iter = 1
point = 0
do while (iter <= walkers)
call random_number(z)
if (z <= 0.25) m = m - 1
if (z > 0.25 .and. z <= 0.5) n = n +1
if (z > 0.5 .and. z <= 0.75) m = m +1
if (z > 0.75) n = n - 1
if (m == 1 .or. m == 12 .or. n == 1 .or. n == 12) then
point = point + matrix(m, n)
m = i
n = j
iter = iter + 1
end if
end do
point = point / walkers
res(i, j) = point
end do
end do
!$omp end do
!$omp end parallel
! Write to file
open(2, file='out_omp.txt')
do i = 1, row
write(2, *)(res(i, j), j=1,col)
end do
contains
end
So, the problem is that parallel program computes MUCH lesser than its sequential version.
Where is the mistake?(except my terrible code)
Update: for now the code is with !$omp do directives, but the result is still the same: it is much lesser than its sequential version.

Most probably, the behavior is related to the random number extraction. RANDOM_NUMBER Fortran procedure is not even guaranteed to be thread-safe but it is thread-safe at least in GNU compiler thanks to a GNU extension. But in any case the performances seem to be very bad as you note.
If you switch to a different thread-safe random number generator, the scalability of your code can be good. I used the classical ran2.f generator:
http://www-star.st-and.ac.uk/~kw25/research/montecarlo/ran2.f
modified to make it thread-safe. If I am not wrong, to do that:
in the calling unit declare and define:
integer :: iv(32), iy, idum2, idum
idum2 = 123456789 ; iv(:) = 0 ; iy = 0
in OpenMP directives add idum as private and idum2, iv, iy as firstprivate (by the way you need to add z as private too)
in the parallel section add (before do)
idum = - omp_get_thread_num()
to have different random numbers for different threads
from ran2 function remove DATA and SAVE lines e pass idum2, iv, iy as arguments:
FUNCTION ran2(idum, iv, iy, idum2)
call ran2 instead of random_number intrinsic
z = ran2(idum, iv, iy, idum2)
With walkers=100000 (GNU compiler) these are my times:
1 thread => 4.7s
2 threads => 2.4s
4 threads => 1.5s
8 threads => 0.78s
16 threads => 0.49s
Not strictly related to the question but I have to say that extracting a real number for each 4 "bit"s info you need (+1 or -1) and the usage of conditionals can be probably changed using a more efficient strategy.

Related

if statement to determine steady-state

My code below correctly solves a 1D heat equation for a function u(x,t). I now want to find the steady-state solution, the solution that no longer changes in time so it should satisfy u(t+1)-u(t) = 0. What is the most efficient way to find the steady-state solution? I show three different attempts below, but I'm not sure if either are actually doing what I want. The first and third have correct syntax, the second method has a syntax error due to the if statement. Each method is different due to the change in the if structure.
Method 1 :
program parabolic1
integer, parameter :: n = 10, m = 20
real, parameter :: h = 0.1, k = 0.005 !step sizes
real, dimension (0:n) :: u,v
integer:: i,j
real::pi,pi2
u(0) = 0.0; v(0) = 0.0; u(n) = 0.0; v(n) =0.0
pi = 4.0*atan(1.0)
pi2 = pi*pi
do i=1, n-1
u(i) = sin( pi*real(i)*h)
end do
do j = 1,m
do i = 1, n-1
v(i) = 0.5*(u(i-1)+u(i+1))
end do
t = real(j)*k !increment in time, now check for steady-state
!steady-state check: this checks the solutions at every space point which I don't think is correct.
do i = 1,n-1
if ( v(i) - u(i) .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
end do
do i = 1, n-1 !updating solution
u(i) = v(i)
end do
end do
end program parabolic1
Method 2 :
program parabolic1
integer, parameter :: n = 10, m = 20
real, parameter :: h = 0.1, k = 0.005 !step sizes
real, dimension (0:n) :: u,v
integer:: i,j
real::pi,pi2
u(0) = 0.0; v(0) = 0.0; u(n) = 0.0; v(n) =0.0
pi = 4.0*atan(1.0)
pi2 = pi*pi
do i=1, n-1
u(i) = sin( pi*real(i)*h)
end do
do j = 1,m
do i = 1, n-1
v(i) = 0.5*(u(i-1)+u(i+1))
end do
t = real(j)*k !increment in time, now check for steady-state
!steady-state check: (This gives an error message since the if statement doesn't have a logical scalar expression, but I want to compare the full arrays v and u as shown.
if ( v - u .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
do i = 1, n-1 !updating solution
u(i) = v(i)
end do
end do
end program parabolic1
Method 3 :
program parabolic1
integer, parameter :: n = 10, m = 20
real, parameter :: h = 0.1, k = 0.005 !step sizes
real, dimension (0:n) :: u,v
integer:: i,j
real::pi,pi2
u(0) = 0.0; v(0) = 0.0; u(n) = 0.0; v(n) =0.0
pi = 4.0*atan(1.0)
pi2 = pi*pi
do i=1, n-1
u(i) = sin( pi*real(i)*h)
end do
do j = 1,m
do i = 1, n-1
v(i) = 0.5*(u(i-1)+u(i+1))
end do
t = real(j)*k !increment in time, now check for steady-state
!steady-state check: Perhaps this is the correct expression I want to use
if( norm2(v) - norm2(u) .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
do i = 1, n-1 !updating solution
u(i) = v(i)
end do
end do
end program parabolic1
Without discussing which method to determine "closeness" is best or correct (not really being a programming problem) we can focus on what the Fortran parts of the methods are doing.
Method 1 and Method 2 are similar ideas (but broken in their execution), while Method 3 is different (and broken in another way).
Note also that in general one wants to compare the magnitude of the difference abs(v-u) rather than the (signed) difference v-u. With non-monotonic changes over iterations these are quite different.
Method 3 uses norm2(v) - norm2(u) to test whether the arrays u and v are similar. This isn't correct. Consider
norm2([1.,0.])-norm2([0.,1.])
instead of the more correct
norm2([1.,0.]-[0.,1.])
Method 2's
if ( v - u .LT. 1.0e-7 ) then
has the problem of being an invalid array expression, but the "are all points close?" can be written appropriately as
if ( ALL( v - u .LT. 1.0e-7 )) then
(You'll find other questions around here about such array reductions).
Method 1 tries something similar, but incorrectly:
do i = 1,n-1
if ( v(i) - u(i) .LT. 1.0e-7 ) then
print*, 'steady-state condition reached'
exit
end if
end do
This is incorrect in one big way, and one subtle way.
First, the loop is exited when the condition tests true the first time, with a message saying the steady state is reached. This is incorrect: you need all values close, while this is testing for any value close.
Second, when the condition is met, you exit. But you don't exit the time iteration loop, you exit the closeness testing loop. (exit without a construct name leaves the innermost do construct). You'll be in exactly the same situation, running again immediately after this innermost construct whether the tested condition is ever or never met (if ever met you'll get the message also). You will need to use a construct name on the time loop.
I won't show how to do that (again there are other questions here about that), because you also need to fix the test condition, by which point you'll be better off using if(all(... (corrected Method 2) without that additional do construct.
For Methods 1 and 2 you'll have something like:
if (all(v-u .lt 1e-7)) then
print *, "Converged"
exit
end if
And for Method 3:
if (norm2(v-u) .lt. 1e-7) then
print *, "Converged"
exit
end if

How to parallelize the nested loop

A small example serial code, which has the same structure as my code, is shown below.
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j
DOUBLE PRECISION :: en,ei,es
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(i) + DBLE(j)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.
Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).
I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?
Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?
On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?
Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.
As mentioned in the comments, a good answer will require more detailed question. However, at a first sight it seems that parallelizing the internal loop
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
should be enough to solve your problem, or at least it will be a good starter. First of all I guess that there is an error on the loop
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(j,k) = DBLE(j) + DBLE(k)
END DO
END Do
since the k is set to 0 and and there is no cell with address corresponding to 0 (see your variable declaration). Also ki is declared ki(1000,2000) array while ki(j,i) is (2000,1000) array. Beside these error, I guess that ki should be calculated as
ki(i,j) = DBLE(j) + DBLE(i)
if true, I suggest you the following solution
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j, k,icr,icr0,icr1
DOUBLE PRECISION :: en,ei,es,timerRate
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
INTEGER,PARAMETER:: nthreads=1
call system_clock(count_rate=icr)
timerRate=real(icr)
call system_clock(icr0)
call omp_set_num_threads(nthreads)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(j) + DBLE(i)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
!$OMP PARALLEL DO private(j,kn,ei) firstpribate(en) shared(ki) reduction(+:es)
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
!$OMP END PARALLEL DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
call system_clock(icr1)
write (*,*) (icr1-icr0)/timerRate ! return computing time
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I add some variables to check the computing time ;-).
This solution is computed in 5.14 s, for nthreads=1, and in 2.75 s, for nthreads=2. It does not divide the computing time by 2, but it seems to be a good deal for a first shot. Unfortunately, on this machine I have a core i3 proc. So I can't do better than nthreads=2. However, I wonder, how the code will behave with nthreads=16 ???
Please let me know
I hope that this helps you.
Finally, I warn about the choice of variables status (private, firstprivate and shared) that might be consider carefully in the real code.

Accumulator within a DO loop

My goal is to create 10,000 randomly generated numbers between 0 and 1, organize them into ten bins evenly spaced between 0 and 1, and compute a frequency for each bin. This is my code so far.
program listrand
implicit none
integer :: n,p
integer :: a,b,c,d,e,f,g,h,i,j = 0
real :: xran
!real, dimension(10,2) :: bin_and_freq -- list of bins and frequency
do n = 1,10000
call random_number(xran)
if (xran < 0.1) then
a = a + 1
elseif (xran>0.1 .and. xran<0.2) then
b = b + 1
elseif (xran>0.2 .and. xran<0.3) then
c = c+1
elseif (xran>0.3 .and. xran<0.4) then
d = d+1
elseif (xran>0.4 .and. xran<0.5) then
e = e + 1
elseif (xran>0.5 .and. xran<0.6) then
f = f+1
elseif (xran>0.6 .and. xran<0.7) then
g = g+1
elseif (xran>0.7 .and. xran<0.8) then
h=h+1
elseif (xran>0.8 .and. xran<0.9) then
i=i+1
else
j = j+1
endif
enddo
print *, a,b,c,d,e,f,g,h,i,j
end program listrand
I am getting an unexpected output:
988 1036 133225987 1004 934 986 1040 33770 1406729616 1052.
Why are c,h, and i so large? Also, is there a more efficient way of going about this than using the unwieldy IF/ELSEIF block I have?
In your long
integer :: a,b,c,d,e,f,g,h,i,j = 0
You are only initialising j to be 0, all others have random numbers in them. If you add
a = 0
b = 0
c = 0
d = 0
e = 0
f = 0
g = 0
h = 0
i = 0
j = 0
before your loop, everything works well.
As for how to simplify it:
Here is my version of the program:
program listrand
implicit none
integer, parameter :: nbins = 10
integer :: n, bin
integer :: bin_hits(nbins) ! Number of bin hits
real :: xran
real :: bin_lower(nbins) ! Lower edge of bins
! bin_lower(1) == 0.0
bin_hits = 0
! Set up equidistant bins
bin_lower = [ (real(n-1) / nbins, n = 1, size(bin_lower)) ]
do n = 1,10000
call random_number(xran)
bin = count(bin_lower <= xran)
bin_hits(bin) = bin_hits(bin)+1
enddo
do n = 1, nbins-1
print '(2(F6.2), I6)' bin_lower(n), bin_lower(n+1), bin_hits(n)
end do
print '(2(F6.2), I6)' bin_lower(nbins), 1.0, bin_hits(nbins)
end program listrand
For the index of which bin_hits element to increment, I'm counting the number of values in bin_lower that are actually lower than xran.
EDIT
I'd like to also point to the answer from High Performance Mark a bit further down, who instead of calling RANDOM_NUMBER for each value individually uses it to generate a whole array of random numbers.
Additionally, he's using the fact that the bins are fixed and equidistant to calculate the bin number directly from the random value instead of comparing it to each bin as in my version.
Both of these make the program faster.
If speed of execution is one's main concern, and if one is willing to trade space for time, this might appeal:
PROGRAM listrand
IMPLICIT NONE
INTEGER, PARAMETER :: nbins = 10
INTEGER, PARAMETER :: nsamples = 10**4
INTEGER :: bin_hits(0:nbins-1)
REAL :: xran(nsamples)
INTEGER :: binned_rn(nsamples), n
bin_hits = 0
CALL RANDOM_NUMBER(xran)
binned_rn = INT(nbins*xran)
DO n = 1, nsamples
bin_hits(binned_rn(n)) = bin_hits(binned_rn(n)) +1
END DO
WRITE(*,*) bin_hits
END PROGRAM listrand
In a limited number of tests this version is 3 - 4 times as fast as #chw21's version.

parallel do mistake in fortran

program main
use omp_lib
implicit none
integer :: n=8
integer :: i, j, myid, a(8, 8), b, c(8)
! Generate a 8*8 array A
!$omp parallel default(none), private(i, myid), &
!$omp shared(a, n)
myid = omp_get_thread_num()+1
do i = 1, n
a(i, myid) = i*myid
end do
!$omp end parallel
! Array A
print*, 'Array A is'
do i = 1, n
print*, a(:, i)
end do
! Sum of array A
b = 0
!$omp parallel reduction(+:b), shared(a, n), private(i, myid)
myid = omp_get_thread_num()+1
do i = 1, n
b = b + a(i, myid)
end do
!$omp end parallel
print*, 'Sum of array A by reduction is ', b
b = 0
c = 0
!$omp parallel do
do i = 1, n
do j = 1, n
c(i) = c(i) + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do is', sum(c)
!$omp parallel do
do i = 1, n
do j = 1, n
b = b + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do in another way is', b
end program main
I wrote a piece of Fortran code above to implement OpenMP to sum up all elements in a 8*8 array in three different ways. First one uses reduction and works. Second, I created a one dimension array with 8 elements. I sum up each column in parallel region and then sum them up. And this works as well. Third one I used an integer to sum up every element in array, and put it in parallel do region. This result is not correct and varies every time. I don't understand why this situation happens. Is because didn't specify public and private or the variable b is overwritten in the procedure?
There is a race condition on b on your third scenario: several threads are reading and writing the same variable without proper synchronization / privatization.
Note that you don't have a race condition in the second scenario: each thread is updating some data (i.e. c(i)) that no one else is accessing.
Finally, some solutions to your last scenario:
Add the reducion(+:b) clause to the pragma
Add a pragma omp atomic directive before the b = b + c(j,i) expression
You can implement a manual privatization

Nesting errors in FORTRAN

I'm creating a program that is required to read values from two arrays (ARR and MRK), counting each set of values (I,J) in order to determine their frequency for a third array (X). I've written the following so far, but nesting errors are preventing the program from compiling. Any help is greatly appreciated!
IMPLICIT NONE
REAL, DIMENSION (0:51, 0:51) :: MRK, ALT
INTEGER :: I, J !! FREQUENCY ARRAY ALLELES
INTEGER, PARAMETER :: K = 2
INTEGER :: M, N !! HAPLOTYPE ARRAY POSITIONS
INTEGER :: COUNTER = 0
REAL, DIMENSION(0:1,0:K-1):: X
ALT = 8
MRK = 8
X = 0
MRK(1:50,1:50) = 0 !! HAPLOTYPE ARRAY WITHOUT BUFFER AROUND OUTSIDE
ALT(1:50,1:50) = 0
DO I = 0, 1 !! ALTRUIST ALLELE
DO J = 0, K-1 !! MARKER ALLELE
DO M = 1, 50
DO N = 1, 50 !! READING HAPLOTYPE POSITIONS
IF ALT(M,N) = I .AND. MRK(M,N) = J THEN
COUNTER = COUNTER + 1
ELSE IF ALT(M,N) .NE. I .OR. MRK(M,N) .NE. J THEN
COUNTER = COUNTER + 0
END IF
X(I,J) = COUNTER/2500
COUNTER = 0
END DO
END DO
END DO
END DO
Your if syntax is incorrect. You should enclose the conditional expressions between brackets. Also, I think you should replace single = by a double == in the same expressions and maybe keep the syntax type to either == and /= or .eq. and .neq., but not mix them:
IF (ALT(M,N) == I .AND. MRK(M,N) == J) THEN
COUNTER = COUNTER + 1
ELSE IF (ALT(M,N) /= I .OR. MRK(M,N) /= J) THEN
COUNTER = COUNTER + 0
END IF
I don't know if in your actual program you do it, but you should probably use program program_name and end program program_name at the very beginning and very end of your code, respectively, where program_name is anything you want to call your program (no spaces allowed I think), although a simple end at the end would suffice.