I'm using Fortran90 to solve a simple integration problem and calculating speed differences when run in parallel. I'm having trouble getting the correct result when paralleling the process using openMP.
program midpoint
use omp_lib
implicit none
integer :: beginning, rate, end, iteration
double precision :: sum, div, x, sum2
integer ::a,b, n
n = 100000000
a = 10
b = 0
div = dble(a-b)/n
x=b+div/2
sum = 0.0
call system_clock(beginning, rate)
do iteration=1,n
sum = sum + sqrt(x)*div ! evaluating sqrt(x) function
x = x + div
end do
call system_clock(end)
print *, "Computation from single core: ", sum
print *, "elapsed time from single core: ", real(end - beginning) / real(rate)
x=b+div/2
sum = 0.0
sum2 = 0.0
call system_clock(beginning, rate)
!$omp parallel private(iteration, sum) shared(sum2, x)
!$omp do
do iteration=1,n
sum = sum + sqrt(x)*div ! evaluating sqrt(x) function
x = x + div
end do
!$omp end do
sum2 = sum2 + sum
!$omp end parallel
call system_clock(end)
print *, "Computation from multiple cores: ", sum2
print *, "elapsed time from multiple cores: ", real(end - beginning) / real(rate)
end program
Thanks
You've programmed a race condition. In the line
sum2 = sum2 + sum
you've given threads the authority to read and write to a shared variable (sum2) with no control over the sequencing of operations. The same problem arises with the next line x = x + div too.
Continue reading your OpenMP tutorial until you encounter the reduction clause which is designed for what you seem to be doing. Learn too about the firstprivate clause which will initialise a thread-local variable with the value of the variable of the same name when the parallel region is first encountered.
I haven't checked the syntax carefully but it should be something like this:
!$omp parallel do private(iteration) firstprivate(x) shared(div) reduction(+:sum)
do iteration=1,n
sum = sum + sqrt(x)*div ! evaluating sqrt(x) function
x = x + div
end do
!$omp end parallel do
! at this point the value of sum will have been 'reduced' across all threads
print *, "Computation from multiple cores: ", sum
Related
I would like to ask whether openMP is capable of parallelizing fortran arrays with the same shape and size using simple notation. I did some research but I am not capable to find or figure out whether it is possible.
I refer as simple notation the following form:
a = b + c * 1.1
Find below a full example:
PROGRAM Parallel_Hello_World
USE OMP_LIB
implicit none
integer, parameter :: ILEN = 1000
integer :: a(ILEN,ILEN), b(ILEN,ILEN), c(ILEN,ILEN), d(ILEN,ILEN)
integer :: i, j
a = 1
b = 2
!$OMP PARALLEL SHARED(a, b, c, d)
!$OMP DO
DO i=1,ILEN
DO j=1, ILEN
c(j,i) = a(j,i) + b(j,i) * 1.1
ENDDO
END DO
!$OMP END DO
# is this loop parallel?
d = a + b * 1.1
!$OMP END PARALLEL
write (*,*) "Total C: ", c(1:5, 1)
write (*,*) "Total D: ", d(1:5, 1)
write (*,*) "C same D? ", all(c == d)
END
Is the d loop parallelized with openMP with the current notation?
As commented by #Gilles the answer to the question is to wrap it with the workshare clause:
!$OMP WORKSHARE
d = a + b * 1.1
!$OMP END WORKSHARE
Find more info here
I have some finite element code I have programmed in Fortran 95 that I have optimised so that I can now get well over 16Mil. elements working under 2GB of memory footprint.
The source function for my code is not smooth so I am using a (stratified) Monte-Carlo method to integrate, which requires a random number generator to select sample locations
I have tried compiling with gfortran-9 using -fopenmp -Ofast -ftree-parallelize-loops=4 but the loop with the random number generator won't go parallel. I tried do concurrent but obviously that didn't work because random_number isn't 'pure'. https://stackoverflow.com/a/32637737/2372254
I also tried blocking my loop but that didn't work either.
Here is the code I am talking about
do k=1,n_els ! total elements is n_els**2. This is block
do i=1+ (k-1)*n_els ,k*n_els
supp_vec = 0
integ_vec = 0.0_wp
! in this subroutine I call random_number
call do_element(ind, n_els, i, num_points_per_strat, &
strat_rows, strat_cols, supp_vec, integ_vec)
do j=1, 4
sc_vec(supp_vec(j) ) = integ_vec(j)
end do
! give some info about progress
if (mod( i , (n_els**2)/10) == 0) print*, i*10/((n_els**2)/10), "% done"
end do
end do
It seems I could write blocks to a file and call n different instances of the routine. I figure there must be a cleaner way to do that. Any tips on how to get that going faster?
I was considering writing a block-worth of points (depending on memory limits) to an array first and supplying that in the subroutine call. Before I try that I thought I would see if anybody had any advice about a better way. It would be good to keep the memory footprint down where possible.
As of version 7 and newer GFortran has a parallel random number generator. When implementing it, here's the OpenMP code I used to verify that the performance indeed scales with increasing numbers of threads (from https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02110.html ):
! Benchmark generating random numbers
! Janne Blomqvist 2015
program randbench
#ifdef _OPENMP
use omp_lib
#endif
implicit none
integer, parameter :: dp=kind(0.d0) ! double precision
integer, parameter :: i64 = selected_int_kind(18) ! At least 64-bit integer
#ifdef _OPENMP
print *, "Using up to ", omp_get_max_threads(), " threads."
#endif
call genr4
call genr8
contains
subroutine genr4
integer, parameter :: n = int(1e7)
real, save :: r(n)
integer :: i
integer(i64) :: t1, t2, td
#ifdef _OPENMP
integer :: blocks, blocksize, l, h
#endif
Print *, "Generate default real random variables"
call system_clock (t1)
!$omp parallel do private(i)
do i = 1, n
call random_number(r(i))
end do
!$omp end parallel do
call system_clock (t2)
td = t2 - t1
print *, "Generating ", n, " default reals individually took ", td, " ticks."
call system_clock (t1)
#ifdef _OPENMP
blocks = omp_get_max_threads()
blocksize = n / blocks
!$omp parallel do private(l,h,i)
do i = 0, blocks - 1
l = i * blocksize + 1
h = l + blocksize - 1
!print *, "Low: ", l, " High: ", h
call random_number(r(l:h))
end do
#else
call random_number(r)
#endif
Call system_clock (t2)
print *, "Generating ", n, " default reals as an array took ", t2-t1, &
" ticks. => ind/arr = ", real(td, dp) / (t2-t1)
end subroutine genr4
subroutine genr8
integer, parameter :: n = int(1e7)
real(dp), save :: r(n)
integer :: i
integer(i64) :: t1, t2, td
#ifdef _OPENMP
integer :: blocks, blocksize, l, h
#endif
print *, "Generate double real random variables"
call system_clock (t1)
!$omp parallel do
do i = 1, n
call random_number(r(i))
end do
call system_clock (t2)
td = t2 - t1
print *, "Generating ", n, " double reals individually took ", td, " ticks."
call system_clock (t1)
#ifdef _OPENMP
blocks = omp_get_max_threads()
blocksize = n / blocks
!$omp parallel do private(l,h,i)
do i = 0, blocks - 1
l = i * blocksize + 1
h = l + blocksize - 1
!print *, "Low: ", l, " High: ", h
call random_number(r(l:h))
end do
#else
call random_number(r)
#endif
call system_clock (t2)
print *, "Generating ", n, " double reals as an array took ", t2-t1, &
" ticks. => ind/arr = ", real(td, dp) / (t2 -t1)
end subroutine genr8
end program
I want to solve the Random Walk problem, so i wrote a fortran sequental code and now i need to parallel this code.
subroutine random_walk(walkers)
implicit none
include "omp_lib.h"
integer :: i, j, col, row, walkers,m,n,iter
real, dimension(:, :), allocatable :: matrix, res
real :: point, z
col = 12
row = 12
allocate (matrix(row, col), res(row, col))
! Read from file
open(2, file='matrix.txt')
do i = 1, row
read(2, *)(matrix(i, j), j=1,col)
end do
res = matrix
! Solve task
!$omp parallel private(i,j,m,n,point,iter)
!$omp do collapse(2)
do i= 2, 11
do j=2, 11
m = i
n = j
iter = 1
point = 0
do while (iter <= walkers)
call random_number(z)
if (z <= 0.25) m = m - 1
if (z > 0.25 .and. z <= 0.5) n = n +1
if (z > 0.5 .and. z <= 0.75) m = m +1
if (z > 0.75) n = n - 1
if (m == 1 .or. m == 12 .or. n == 1 .or. n == 12) then
point = point + matrix(m, n)
m = i
n = j
iter = iter + 1
end if
end do
point = point / walkers
res(i, j) = point
end do
end do
!$omp end do
!$omp end parallel
! Write to file
open(2, file='out_omp.txt')
do i = 1, row
write(2, *)(res(i, j), j=1,col)
end do
contains
end
So, the problem is that parallel program computes MUCH lesser than its sequential version.
Where is the mistake?(except my terrible code)
Update: for now the code is with !$omp do directives, but the result is still the same: it is much lesser than its sequential version.
Most probably, the behavior is related to the random number extraction. RANDOM_NUMBER Fortran procedure is not even guaranteed to be thread-safe but it is thread-safe at least in GNU compiler thanks to a GNU extension. But in any case the performances seem to be very bad as you note.
If you switch to a different thread-safe random number generator, the scalability of your code can be good. I used the classical ran2.f generator:
http://www-star.st-and.ac.uk/~kw25/research/montecarlo/ran2.f
modified to make it thread-safe. If I am not wrong, to do that:
in the calling unit declare and define:
integer :: iv(32), iy, idum2, idum
idum2 = 123456789 ; iv(:) = 0 ; iy = 0
in OpenMP directives add idum as private and idum2, iv, iy as firstprivate (by the way you need to add z as private too)
in the parallel section add (before do)
idum = - omp_get_thread_num()
to have different random numbers for different threads
from ran2 function remove DATA and SAVE lines e pass idum2, iv, iy as arguments:
FUNCTION ran2(idum, iv, iy, idum2)
call ran2 instead of random_number intrinsic
z = ran2(idum, iv, iy, idum2)
With walkers=100000 (GNU compiler) these are my times:
1 thread => 4.7s
2 threads => 2.4s
4 threads => 1.5s
8 threads => 0.78s
16 threads => 0.49s
Not strictly related to the question but I have to say that extracting a real number for each 4 "bit"s info you need (+1 or -1) and the usage of conditionals can be probably changed using a more efficient strategy.
program main
use omp_lib
implicit none
integer :: n=8
integer :: i, j, myid, a(8, 8), b, c(8)
! Generate a 8*8 array A
!$omp parallel default(none), private(i, myid), &
!$omp shared(a, n)
myid = omp_get_thread_num()+1
do i = 1, n
a(i, myid) = i*myid
end do
!$omp end parallel
! Array A
print*, 'Array A is'
do i = 1, n
print*, a(:, i)
end do
! Sum of array A
b = 0
!$omp parallel reduction(+:b), shared(a, n), private(i, myid)
myid = omp_get_thread_num()+1
do i = 1, n
b = b + a(i, myid)
end do
!$omp end parallel
print*, 'Sum of array A by reduction is ', b
b = 0
c = 0
!$omp parallel do
do i = 1, n
do j = 1, n
c(i) = c(i) + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do is', sum(c)
!$omp parallel do
do i = 1, n
do j = 1, n
b = b + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do in another way is', b
end program main
I wrote a piece of Fortran code above to implement OpenMP to sum up all elements in a 8*8 array in three different ways. First one uses reduction and works. Second, I created a one dimension array with 8 elements. I sum up each column in parallel region and then sum them up. And this works as well. Third one I used an integer to sum up every element in array, and put it in parallel do region. This result is not correct and varies every time. I don't understand why this situation happens. Is because didn't specify public and private or the variable b is overwritten in the procedure?
There is a race condition on b on your third scenario: several threads are reading and writing the same variable without proper synchronization / privatization.
Note that you don't have a race condition in the second scenario: each thread is updating some data (i.e. c(i)) that no one else is accessing.
Finally, some solutions to your last scenario:
Add the reducion(+:b) clause to the pragma
Add a pragma omp atomic directive before the b = b + c(j,i) expression
You can implement a manual privatization
I have already current code, but it still not working. If code is correct, please help how I can compile it. I had tried it to compile so:
gfortran trap.f -fopenmp
PROGRAM TRAP
USE OMP_LIB
DOUBLE PRECISION INTEG, TMPINT
DOUBLE PRECISION A, B
PARAMETER (A=3.0, B=7.0)
INTEGER N
PARAMETER (N=10)
DOUBLE PRECISION H
DOUBLE PRECISION X
INTEGER I
DOUBLE PRECISION F
H = (B-A)/N
INTEG = 0.0
TMPINT = 0.0
!$omp parallel firstprivate(X, TMPINT) shared(INTEG)
!$omp do
DO 10 I=1,N-1,1
X=A+I*H
TMPINT = TMPINT + F(X)
10 CONTINUE
!$omp end do
!$omp critical
INTEG = INTEG + TMPINT
!$omp end critical
!$omp end parallel
NTEG = (INTEG+(F(A)+F(B))/2.0)*H
PRINT *, "WITH N=", N, "INTEGRAL=", INTEG
END
FUNCTION F(X)
DOUBLE PRECISION X
F = X / (X + 1) * EXP(-X + 2)
END
Compiler gives following problems:
[http://i.stack.imgur.com/QPknv.png][1]
[http://i.stack.imgur.com/GYkmN.png][2]
Your program has a suffix .f, so gfortran assumes that the code is in fixed format and complains that many statements are "unclassifiable". To fix this, change the file name to trap.f90 and compile as gfortran -fopenmp trap.f90 to assume free format. There are also other problems: one is that the return type of function F(X) does not match with the type declared in the main program, so F(X) needs to be modified as
FUNCTION F(X)
implicit none !<--- this is always recommended
DOUBLE PRECISION X, F !<--- add F here
F = X / (X + 1) * EXP(-X + 2)
END
Another issue is that NTEG is probably a typo of INTEG, so it should be modified as
INTEG = (INTEG+(F(A)+F(B))/2.0)*H
(this is automatically detected if we have implicit none in the main program). Now running the code with, e.g. 8 threads, gives
$ OMP_NUM_THREADS=8 ./a.out
WITH N= 10 INTEGRAL= 0.28927708626319770
while the exact result is 0.28598... Increasing the value of N, we can confirm that the agreement becomes better:
WITH N= 100 INTEGRAL= 0.28602065571967972
WITH N= 1000 INTEGRAL= 0.28598803555916535
WITH N= 10000 INTEGRAL= 0.28598770935198736
WITH N= 100000 INTEGRAL= 0.28598770608991503
BTW, it is probably easier to use the reduction clause to do the same thing, for example:
INTEG = 0.0
!$omp parallel do reduction(+ : integ) private(x)
DO I = 1, N-1
X = A + I * H
INTEG = INTEG + F( X )
ENDDO
!$omp end parallel do
INTEG = (INTEG+(F(A)+F(B))/2.0)*H
Your code is in fixed form (.f). Therefore, you must code by the rules of the fixed format: The first six characters on each line have a special meaning and should be blank unless you specify a comment in the first position, a line continuation (sixth position), or statement labels 10.
If you format your code accordingly, the compiler complains about a mismatch in the return value of F(X). As you do not use implicit none, the type is defined by the first letter of the function, and F maps to a (single precision) real. So you need to specify the return type explicitly.
Then the code looks like:
PROGRAM TRAP
USE OMP_LIB
DOUBLE PRECISION INTEG, TMPINT
DOUBLE PRECISION A, B
PARAMETER (A=3.0, B=7.0)
INTEGER N
PARAMETER (N=10)
DOUBLE PRECISION H
DOUBLE PRECISION X
INTEGER I
DOUBLE PRECISION F
H = (B-A)/N
INTEG = 0.0
TMPINT = 0.0
c$omp parallel firstprivate(X, TMPINT) shared(INTEG)
c$omp do
DO 10 I=1,N-1,1
X=A+I*H
TMPINT = TMPINT + F(X)
10 CONTINUE
c$omp end do
c$omp critical
INTEG = INTEG + TMPINT
c$omp end critical
c$omp end parallel
INTEG = (INTEG+(F(A)+F(B))/2.0)*H
PRINT *, "WITH N=", N, "INTEGRAL=", INTEG
END
DOUBLE PRECISION FUNCTION F(X)
DOUBLE PRECISION X
F = X / (X + 1) * EXP(-X + 2)
END
[Please note that I also fixed the NTAG = line into INTEG= as I believe this is intended. I did not check the code for validity. ]