openmp gives different result from sequential - fortran

I am a beginner with openmp. I am trying to parallize a simple loop. But its clearly wrong.
Can anyone kindly help?
use omp_lib
implicit none
integer ::n=20,it,j
real(8) :: a=0,b=5,s,func,del,sum1,tnm,x
it=2**(n-2)
tnm=it
del=(b-a)/tnm
x=a+.5*del
sum1=0.0
!$omp parallel do reduction(+:sum1)
do j=1,it
sum1=sum1+func(x)
! write(*,*)func(x)
x=x+del
end do
!$omp end parallel do
s=.5*(s+(b-a)*sum1/tnm)
write(*,*)s,sum1,it,del
end
function func(x)
real(8)::x
func=x
end function
EDIT
Results:
Parallel run::
$ ifort -openmp trapzd.f90
$ ./a.out
0.976562500000000 102400.000000000 262144
1.907348632812500E-005
Sequential Run
$ ifort trapzd.f90
$ ./a.out
6.25000000000000 655360.000000000 262144
1.907348632812500E-005
s,sum1 is different in this two run

I see three issues here:
A missing type declaration for the function func
s is used uninitialized
Race condition for variable x
This works for me:
program test
use omp_lib
implicit none
integer :: n=20,it,j
real(8) :: a=0,b=5,s,func,del,sum1,tnm,x,x0
it=2**(n-2)
tnm=it
del=(b-a)/tnm
x=a+.5*del
sum1=0.0
x0 = x
!$omp parallel do reduction(+:sum1)
do j=1,it
sum1=sum1+func(x0+j*del)
! write(*,*)func(x)
end do
!$omp end parallel do
! s is uninitialized here - set it to 0
! ...probably not what you had in mind
s = 0.
s=.5*(s+(b-a)*sum1/tnm)
write(*,*)s,sum1,it,del
end program
real(8) function func(x)
real(8)::x
func=x
end function
Compiled with gfortran:
OMP_NUM_THREADS=1 ./a.out
6.2500000000000000 655360.00000000000 262144 1.9073486328125000E-005
OMP_NUM_THREADS=6 ./a.out
6.2500000000000000 655360.00000000000 262144 1.9073486328125000E-005
Compiled with ifort:
OMP_NUM_THREADS=1 ./a.out
6.25000000000000 655360.000000000 262144 1.907348632812500E-005
OMP_NUM_THREADS=6 ./a.out
6.25000000000000 655360.000000000 262144 1.907348632812500E-005

Related

Possible bug in ifort

While trying to use a type for blockdiagonal matrices in Fortran code,
I stumbled over a surprising bug in the following piece of code using the following compilers:
GNU Fortran (SUSE Linux) 7.4.0
ifort (IFORT) 18.0.5 20180823
ifort (IFORT) 16.0.1 20151021
If I compile
gfortran -Wall -Werror --debug ifort_bug.f && valgrind ./a.out
I have no errors reported from valgrind.
If I compile
ifort -warn all,error -debug -stacktrace ifort_bug.f && valgrind ./a.out
I get a segmentation fault for ifort_18 in my code and "only" memory leaks for ifort_16.
Is this a bug in the intel compilers, or does gfortran silently fix my bad code?
module blockdiagonal_matrices
implicit none
private
public :: t_blockdiagonal, new, delete,
& blocksizes, operator(.mult.), mult
save
integer, parameter :: dp = kind(1.d0)
type :: t_blockdiagonal
real(dp), allocatable :: block(:, :)
end type
interface new
module procedure block_new
end interface
interface delete
module procedure block_delete
end interface
interface operator (.mult.)
module procedure mult_blocks
end interface
contains
subroutine block_new(blocks, blocksizes)
type(t_blockdiagonal), intent(out) :: blocks(:)
integer, intent(in) :: blocksizes(:)
integer :: i, L
do i = 1, size(blocks)
L = blocksizes(i)
allocate(blocks(i)%block(L, L))
end do
end subroutine
subroutine block_delete(blocks)
type(t_blockdiagonal) :: blocks(:)
integer :: i
do i = 1, size(blocks)
deallocate(blocks(i)%block)
end do
end subroutine
function blocksizes(A) result(res)
type(t_blockdiagonal), intent(in) :: A(:)
integer :: res(size(A))
integer :: i
res = [(size(A(i)%block, 1), i = 1, size(A))]
end function
function mult_blocks(A, B) result(C)
type(t_blockdiagonal), intent(in) :: A(:), B(:)
type(t_blockdiagonal) :: C(size(A))
integer :: i
call new(C, blocksizes=blocksizes(A))
do i = 1, size(A)
C(i)%block = matmul(A(i)%block, B(i)%block)
end do
end function
subroutine mult(A, B, C)
type(t_blockdiagonal), intent(in) :: A(:), B(:)
type(t_blockdiagonal) :: C(:)
integer :: i
do i = 1, size(A)
C(i)%block = matmul(A(i)%block, B(i)%block)
end do
end subroutine
end module blockdiagonal_matrices
program time_blockdiagonal
use blockdiagonal_matrices
integer, parameter :: n_blocks = 2, L_block = 10**2
type(t_blockdiagonal) :: A(n_blocks), B(n_blocks), C(n_blocks)
integer :: i
integer :: start, finish, rate
call system_clock(count_rate=rate)
call new(A, blocksizes=[(L_block, i = 1, n_blocks)])
call new(B, blocksizes=[(L_block, i = 1, n_blocks)])
call new(C, blocksizes=[(L_block, i = 1, n_blocks)])
do i = 1, n_blocks
call random_number(A(i)%block)
call random_number(B(i)%block)
end do
call system_clock(start)
C = A .mult. B
call system_clock(finish)
write(6,*) 'Elapsed Time in seconds:',
& dble(finish - start) / dble(rate)
call system_clock(start)
call mult(A, B, C)
call system_clock(finish)
write(6,*) 'Elapsed Time in seconds:',
& dble(finish - start) / dble(rate)
call delete(A)
call delete(B)
call delete(C)
end program time_blockdiagonal
The ifort_18 segmentation fault is:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
a.out 00000000004134BD Unknown Unknown Unknown
libpthread-2.26.s 00007FF720EB6300 Unknown Unknown Unknown
a.out 000000000040ABB8 Unknown Unknown Unknown
a.out 000000000040B029 Unknown Unknown Unknown
a.out 0000000000407113 Unknown Unknown Unknown
a.out 0000000000402B4E Unknown Unknown Unknown
libc-2.26.so 00007FF720B0AF8A __libc_start_main Unknown Unknown
a.out 0000000000402A6A Unknown Unknown Unknown
When I went into it with gdb it turned out, that the segfault is raised upon returning from function mult_blocks.
blockdiagonal_matrices::mult_blocks (c=..., a=..., b=...) at ifort_bug.f:63
63 do i = 1, size(A)
(gdb) s
64 C(i)%block = matmul(A(i)%block, B(i)%block)
(gdb) s
s
s
65 end do
(gdb) s
64 C(i)%block = matmul(A(i)%block, B(i)%block)
(gdb) s
65 end do
(gdb) s
66 end function
(gdb) s
Program received signal SIGSEGV, Segmentation fault.
0x000000000040abc4 in do_deallocate_all ()
(gdb) q
A debugging session is active.
Even with this information I cannot find a bug in my code.
EDIT
I found a fix, although I don't understand why it works.
Compilation fix
If I use -heap-arrays for compilation it works. So at first glance it seems to run into a Stackoverflow. If I do ulimit -s unlimited it does not solve the problem though.
Code fix
If I explicitly allocate in the code it solves the issue.
subroutine new(blocks, blocksizes)
type(t_blockdiagonal), allocatable, intent(out) :: blocks(:)
integer, intent(in) :: blocksizes(:)
integer :: i, L
allocate(blocks(size(blocksizes)))
do i = 1, size(blocks)
L = blocksizes(i)
allocate(blocks(i)%block(L, L))
end do
end subroutine
subroutine delete(blocks)
type(t_blockdiagonal), allocatable :: blocks(:)
integer :: i
do i = 1, size(blocks)
deallocate(blocks(i)%block)
end do
deallocate(blocks)
end subroutine
function mult_blocks(A, B) result(C)
type(t_blockdiagonal), intent(in) :: A(:), B(:)
type(t_blockdiagonal), allocatable :: C(:)
integer :: i
call new(C, blocksizes=blocksizes(A))
do i = 1, size(A)
C(i)%block = matmul(A(i)%block, B(i)%block)
end do
end function
This way of writing is IMHO actually better than before and not a "dirty hack".
It is not anymore possible to call new with a differing size of blocksize and blocks.
Open question
Speaking C the type(t_blockdiagonal) :: blocks(n)
Should be a float** blocks[n] so just a vector of pointers to pointers.
The allocation of the actual blocks happened also in the first version on the heap. Hence I don't get the Stackoverflow for a vector that contains ca. 10 pointers.
IntelĀ® Fortran Compiler - Increased stack usage of 8.0 or higher compilers causes segmentation fault (via archive.org)
According to this article, ulimit -s unlimited does not mean that the stack will be literally "unlimited":
The size of "unlimited" varies by Linux configuration, so you may need to specify a larger, specific number to ulimit (for example, 999999999). On Linux also note that many 32bit Linux distributions ship with a pthread static library (libpthread.a) that at runtime will fix the stacksize to 2093056 bytes regardless of the ulimit setting.
It is most likely that you do indeed have a stack overflow, given that -heap-arrays solves the issue.

Thread Segmentation fault when calling function in loop with OpenMP

I'm trying to use OpenMP in Fortran 90 to parallelize a do loop with function call inside. The code listed first runs fine. The code listed next does not. I receive a segmentation fault.
First program: $ gfortran -O3 -o output -fopenmp OMP10.f90
program OMP10
!$ use omp_lib
IMPLICIT NONE
integer, parameter :: n = 100000
integer :: i
real(kind = 8) :: sum,h,x(0:n),f(0:n),ZBQLU01
!$ call OMP_set_num_threads(4)
h = 2.d0/dble(n)
!$OMP PARALLEL DO PRIVATE(i)
do i = 0,n
x(i) = -1.d0+dble(i)*h
f(i) = 2.d0*x(i)
end do
!$OMP END PARALLEL DO
sum = 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:SUM)
do i = 0,n-1
sum = sum + h*f(i)
end do
!$OMP END PARALLEL DO
write(*,*) "The integral is ", sum
end program OMP10
Second program: $ gfortran -O3 -o output -fopenmp randgen.f OMP10.f90
program OMP10
!$ use omp_lib
IMPLICIT NONE
integer, parameter :: n = 100000
integer :: i
real(kind = 8) :: sum,h,x(0:n),f(0:n),ZBQLU01
!$ call OMP_set_num_threads(4)
h = 2.d0/dble(n)
!$OMP PARALLEL DO PRIVATE(i)
do i = 0,n
x(i) = ZBQLU01(0.d0)
end do
!$OMP END PARALLEL DO
sum = 0.d0
!$OMP PARALLEL DO PRIVATE(i) REDUCTION(+:SUM)
do i = 0,n-1
sum = sum + h*f(i)
end do
!$OMP END PARALLEL DO
write(*,*) "The integral is ", sum
end program OMP10
In the above command, randgen.f is a library that contains the function ZBQLU01.
You cannot just call any function from a parallel region. The function must be thread safe. See What is meant by "thread-safe" code? and https://en.wikipedia.org/wiki/Thread_safety .
Your function is quite the opposite of thread safe as is quite typical for random number generators. Just notice the SAVE statements in the source code for many local variables and for a common block.
The solution is to use a good parallel random number generator. The site is not for software recommendation, but as a pointer just search the web for "parallel prng" or "parallel random number generator". I personally use a library which I already pointed to in https://stackoverflow.com/a/38263032/721644 A simple web search reveals another simple possibility in https://jblevins.org/log/openmp . And then there are many larger and more complex libraries.

OpenMP: how to protect an array from race condition

This is a follow up to question 36182486, 41421437 and several others. I want to speed up the assembly of skewness and mass matrices for a FEM calculation by using multiple processors to deal with individual elements in parallel. This little MWE shows the guts of the operation.
!! compile with gfortran -fopenmp -o FEMassembly FEMassembly.f90
Program FEMassembly
use, intrinsic :: iso_c_binding
implicit none
real (c_double) :: arrayM(3,3)=reshape((/2.d0,1.d0,1.d0,1.d0,&
&2.d0,1.d0,1.d0,1.d0,2.d0/),(/3,3/)) ! contrib from one element
integer (c_int) :: ke,ne=4,kx,nx=6,nodes(3)
real (c_double) :: L(6,6)
integer (c_int) :: t(4,3)=reshape((/1,2,5,6,2,3,4,5,4,5,2,3/),(/4,3/))
!! first, no OMP
do ke=1,ne ! for each triangular element
nodes=t(ke,:)
L(nodes,nodes)=L(nodes,nodes)+arrayM
end do
print *,'L no OMP'
write(*,fmt="(6(1x,f3.0))")(L(kx,1:6),kx=1,nx)
L=0
!$omp parallel do private (nodes)
do ke=1,ne ! for each triangular element
nodes=t(ke,:)
!! !$omp atomic
L(nodes,nodes)=L(nodes,nodes)+arrayM
!! !$omp end atomic
end do
!$omp end parallel do
print *,'L with OMP and race'
write(*,fmt="(6(1x,f3.0))")(L(kx,1:6),kx=1,nx)
End Program FEMassembly
With the atomic directives commented out, the array L contains several wrong values, presumably because of the race condition I was trying to avoid with the atomic directives. The results are:
L no OMP
2. 1. 0. 1. 0. 0.
1. 6. 1. 2. 2. 0.
0. 1. 4. 0. 2. 1.
1. 2. 0. 4. 1. 0.
0. 2. 2. 1. 6. 1.
0. 0. 1. -0. 1. 2.
L with OMP and race
2. 1. 0. 1. 0. 0.
1. 6. 1. 2. 2. 0.
0. 1. 2. 0. 2. 1.
1. 2. 0. 4. 1. 0.
0. 2. 2. 1. 6. 1.
0. 0. 1. 0. 1. 2.
If the "atomic" directives are uncommented, the compiler return the error:
Error: !$OMP ATOMIC statement must set a scalar variable of intrinsic type at (1)
where (1) points to arrayM in the line L(nodes,nodes).....
What I am hoping to achieve is have the time consuming contributions from each element (here the trivial arrayM) happen in parallel, but since several threads address the same matrix element, something has to be done to have the sum occur in an orderly fashion. Can anyone suggest a way to do this?
In Fortran the simplest way is to use a reduction. This is because OpenMP for Fortran supports reductions on arrays. Below is what I think you are trying to do, but take it with a pinch of salt because
You don't provide the correct output so it's difficult to test
With such a small array sometimes race conditions are difficult to find
!! compile with gfortran -fopenmp -o FEMassembly FEMassembly.f90
Program FEMassembly
use, intrinsic :: iso_c_binding
Use omp_lib, Only : omp_get_num_threads
implicit none
real (c_double) :: arrayM(3,3)=reshape((/2.d0,1.d0,1.d0,1.d0,&
&2.d0,1.d0,1.d0,1.d0,2.d0/),(/3,3/)) ! contrib from one element
integer (c_int) :: ke,ne=4,nodes(3)
real (c_double) :: L(6,6)
integer (c_int) :: t(4,3)=reshape((/1,2,5,6,2,3,4,5,4,5,2,3/),(/4,3/))
! Not declared in original program
Integer :: nx, kx
! Not set in original program
nx = Size( L, Dim = 1 )
!$omp parallel default( none ) private ( ke, nodes ) shared( ne, t, L, arrayM )
!$omp single
Write( *, * ) 'Working on ', omp_get_num_threads(), ' threads'
!$omp end single
!$omp do reduction( +:L )
do ke=1,ne ! for each triangular element
nodes=t(ke,:)
L(nodes,nodes)=L(nodes,nodes)+arrayM
end do
!$omp end do
!$omp end parallel
write(*,fmt="(6(1x,f3.0))")(L(kx,1:6),kx=1,nx)
End Program FEMassembly

Open MP if OpenMP:...else

Problem:
I have a some code that myself and a few others have been writing, I took the code and made it use mpi and openmp with great results (helps that I am running it on a Blue Gene/Q).
One thing I am not a fan of is that now I cannot compile the code without the -openmp directive because to get the speedup I needed I used reduction variables.
Example:
!$OMP parallel do schedule(DYNAMIC, 4) reduction(min:min_val)
....
min_val = some_expression(i)
....
!$OMP end parallel do
result = sqrt(min_val)
I am looking for something like:
!$OMP if OMP:
!$OMP min_val = some_expression(i)
!$OMP else:
if ( min_val .gt. some_expression(i) ) min_val = some_expression(i)
!$OMP end else
Anybody know of something like this? Notice that without -openmp the !$OMP lines are ignored and the code runs normally with the correct, er same, answer.
Thanks,
(Yes it is FORTRAN code, but its almost identical to C and C++)
To your exact question:
!$ whatever_statement
will use that statement only when compiled with OpenMP.
Otherwise, in your specific case, can't you just use:
!$OMP parallel do schedule(DYNAMIC, 4) reduction(min:min_val)
....
min_val = min(min_val, some_expression(i))
....
!$OMP end parallel do
result = sqrt(min_val)
?
I'm using this normally with and without -openmp quite often.
If you are willing to use pre-processed FORTRAN source file, you can always rely on the macro _OPENMP to be defined when using OpenMP. The simplest example is:
program pippo
#ifdef _OPENMP
print *, "OpenMP program"
#else
print *, "Non-OpenMP program"
#endif
end program pippo
Compiled with:
gfortran -fopenmp main.F90
the program will give the following output:
OpenMP program
If you are unwilling to use pre-processed source files, then you can set a variable using FORTRAN conditional compilation sentinel:
program pippo
implicit none
logical :: use_openmp = .false.
!$ use_openmp = .true.
!$ print *, "OpenMP program"
if( .not. use_openmp) then
print *, "Non-OpenMP program"
end if
end program pippo

Loop in fortran with openmp and allocatable arrays

I like to do this:
program main
implicit none
integer l
integer, allocatable, dimension(:) :: array
allocate(array(10))
array = 0
!$omp parallel do private(array)
do l = 1, 10
array(l) = l
enddo
!$omp end parallel do
print *, array
deallocate(array)
end
But I am running into error messages:
* glibc detected * ./a.out: munmap_chunk(): invalid pointer: 0x00007fff25d05a40 *
This seems to be a bug in ifort according to some discussions at intel forums but should be resolved in the version I am using (11.1.073 - Linux). This is a MASSIVE downscaled version of my code! I unfortunately can not use static arrays to have a workaround.
If I put the print into the loop, I get other errors:
* glibc detected ./a.out: double free or corruption (out): 0x00002b22a0c016f0 **
I didn't get the errors you're getting, but you have an issue with privatizing array in your OpenMP call.
[mjswartz#666-lgn testfiles]$ vi array.f90
[mjswartz#666-lgn testfiles]$ ifort -o array array.f90 -openmp
[mjswartz#666-lgn testfiles]$ ./array
0 0 0 0 0 0
0 0 0 0
[mjswartz#666-lgn testfiles]$ vi array.f90
[mjswartz#666-lgn testfiles]$ ifort -o array array.f90 -openmp
[mjswartz#666-lgn testfiles]$ ./array
1 2 3 4 5 6
7 8 9 10
First run is with private array, second is without.
program main
implicit none
integer l
integer, allocatable, dimension(:) :: array
allocate(array(10))
!$omp parallel do
do l = 1, 10
array(l) = l
enddo
print*, array
deallocate(array)
end program main
I just ran your code with ifort and openmp and it spewed 0d0's. I had to manually quit the execution. What is your expected output? I'm not a big fan of unnecessarily dynamically allocating arrays. You know what you're going to allocate your matrices as, so just make parameters and statically do it. I'll mess with some stuff and edit this response in a few.
Ok, so here's my edits:
program main
implicit none
integer :: l, j
integer, parameter :: lmax = 15e3
integer, parameter :: jmax = 25
integer, parameter :: nk = 300
complex*16, dimension(9*nk) :: x0, xin, xout
complex*16, dimension(lmax) :: e_pump, e_probe
complex*16 :: e_pumphlp, e_probehlp
character*25 :: problemtype
real*8 :: m
! OpenMP variables
integer :: myid, nthreads, omp_get_num_threads, omp_get_thread_num
x0 = 0.0d0
problemtype = 'type1'
if (problemtype .ne. 'type1') then
write(*,*) 'Problem type not specified. Quitting'
stop
else
! Spawn a parallel region explicitly scoping all variables
!$omp parallel
myid = omp_get_thread_num()
if (myid .eq. 0) then
nthreads = omp_get_num_threads()
write(*,*) 'Starting program with', nthreads, 'threads'
endif
!$omp do private(j,l,m,e_pumphlp,e_probehlp,e_pump,e_probe)
do j = 1, jmax - 1
do l = 1, lmax
call electricfield(0.0d0, 0.0d0, e_pumphlp, &
e_probehlp, 0.0d0)
! print *, e_pumphlp, e_probehlp
e_pump(l) = e_pumphlp
e_probe(l) = e_probehlp
print *, e_pump(l), e_probe(l)
end do
end do
!$omp end parallel
end if
end program main
Notice I removed your use of a module since it was unnecessary. You have an external module containing a subroutine, so just make it an external subroutine. Also, I changed your matrices to be statically allocated. Case statements are a fancy and expensive version of if statements. You were casing 15e3*25 times rather than once (expensive), so I moved those outside. I changed the OpenMP calls, but only semantically. I gave you some output so that you know what OpenMP is actually doing.
Here is the new subroutine:
subroutine electricfield(t, tdelay, e_pump, e_probe, phase)
implicit none
real*8, intent(in) :: t, tdelay
complex*16, intent(out) :: e_pump, e_probe
real*8, optional, intent (in) :: phase
e_pump = 0.0d0
e_probe = 0.0d0
return
end subroutine electricfield
I just removed the module shell around it and changed some of your variable names. Fortran is not case sensitive, so don't torture yourself by doing caps and having to repeat it throughout.
I compiled this with
ifort -o diffeq diffeq.f90 electricfield.f90 -openmp
and ran with
./diffeq > output
to catch the program vomiting 0's and to see how many threads I was using:
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
Starting program with 32 threads
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
Hope this helps!
It would appear that you are running into a compiler bug associated with the implementation of OpenMP 3.0.
If you can't update your compiler, then you will need to change your approach. There are a few options - for example you could make the allocatable arrays shared, increase their rank by one and have one thread allocate them such that the extent of the additional dimension is the number of workers in the team. All subsequent references to those arrays then need to be have the subscript for that additional rank be the omp team number (+ 1, depending on what you've used for the lower bound).
Explicit allocation of the private allocatable arrays inside the parallel construct (only) may also be an option.