Fortran Modules with OpenMP - fortran

It seems to me that fortran modules, which can be used to hold global variables across subroutines, don't work the same when using OpenMP. Here's an example:
main.f90
program main
use mod
implicit none
!$OMP PARALLEL private(a)
!$OMP DO
do i=1,10
a=i-1
print*,"a =",a
call echo
print*,"b =",b
enddo
!$OMP END DO
!$OMP END PARALLEL
end program main
echo.f90
subroutine echo
use mod
implicit none
b=a+1
!print *,a,"+1=",b
end subroutine echo
mod.f90
module mod
integer:: i,a,b
end module mod
Now if you compile and run this without OpenMP you get:
a = 0
b = 1
a = 1
b = 2
a = 2
.....ect. This is what you'd expect
But, if you compile WITH openMP you get:
a = 7
b = 1
a = 6
b = 1
a = 8
.....ect. This is not what I want. I know that the echo subroutine is getting 'a' from the module, not the private 'a' that the thread has. Is there any way to do this besides passing it as an argument? There are a ton of variables in my module and it would be tedious.

Inside the procedure echo, a and b are variables that are referenced in a region but not in a construct - execution wise they appear in between a matching !$OMP PARALLEL and !$OMP END PARALLEL directive, but source wise they do not. As they are module variables, and in the absence of directives to the contrary, the rules for data sharing attributes in 2.14.1.2 of the OpemMP 4.0 standard specify that those variables inside the procedure are shared.
Consequently your example code has a data race, with multiple threads writing to b inside the echo subroutine without synchronization.
You can use the THREADPRIVATE directive in the module to change the data sharing attribute of those module variables. You will need to remove the private specification for a at the same time.
In the long run, a far better approach may be to make the flows of information in your program explicit to a reader of the code (and more flexibly configurable by a code writer), by passing information as arguments (perhaps bundled together in derived types) rather than hiding those flows through the use of global (module) variables.

Related

Declare all Fortran module variables target OpenMP 4.5+

I have a fortran90 code that use chemical species properties (i.e. molecular weight, viscosity, etc.) for calculations.
To easily swap in and out groups of chemical species, we keep module files that store all the relevant data in 1D arrays. I.e. we have 4 species, the viscosity array is 4 elements long, one entry for each species and so on.
The relevant subroutines that need this data can then use this module, and the chemical data is available as needed.
We have ported a majority of the code to GPU offloading with openMP 4.5 and at the point of porting over these chemical calculations.
What I would like to do is just place the entire module onto the GPU, so that any subroutines that use these module variables have access to them on the target device.
My initial though was to just !$omp declare target the module like we do functions or other subroutines, but that doesn't seem to be accepted by the compiler.
Do I really have to !$omp declare target(variable_x, y,z,a,b,c......) for the entire module?
And if I do that, what is the scope then of these variables? Are they accessible to everything on the device now even if a subroutine doesn't use the module? Or is the compiler smart enough to keep them within the module scope of the subroutine using them?
Lastly, is there anything special that needs to be done to a subroutine that uses these modules when I am just creating target regions within the subroutine? For example:
subroutine test
use chem_module
implicit none
integer :: i
!$omp parallel do
do i=1,100
*do some calcs with module data
*do I need to tell the compiler about the chem_module module?
end do
!$omp end parallel do
end subroutine
Thanks for taking a look!
Turns out that just part of the Fortran API... it requires a list and you can't do an encompassing
!$omp declare target
declare stuff...
!$omp end declare target
So yes, you need a massive list as far as I know.

OpenMP race condition (Fortran 77 w/ COMMON block)

I am trying to parallelise some legacy Fortran code with OpenMP.
Checking for race conditions with Intel Inspector, I have come across a problem in the following code (simplified, tested example):
PROGRAM TEST
!$ use omp_lib
implicit none
DOUBLE PRECISION :: x,y,z
COMMON /firstcomm/ x,y,z
!$OMP THREADPRIVATE(/firstcomm/)
INTEGER :: i
!$ call omp_set_num_threads(3)
!$OMP PARALLEL DO
!$OMP+ COPYIN(/firstcomm/)
!$OMP+ PRIVATE(i)
do i=1,3000
z = 3.D0
y = z+log10(z)
x=y+z
enddo
!$OMP END PARALLEL DO
END PROGRAM TEST
Intel Inspector detects a race condition between the following lines:
!$OMP PARALLEL DO (read)
z = 3.D0 (write)
The Inspector "Disassembly" view offers the following about the two lines, respectively (I do not understand much about these, apart from the fact that the memory addresses in both lines seem to be different):
0x3286 callq 0x2a30 <memcpy>
0x3338 movq %r14, 0x10(%r12)
As in my main application, the problem occurs for one (/some) variable in the common block, but not for others that are treated in what appears to be the same way.
Can anyone spot my mistake, or is this race condition a false positive?
I am aware that the use of COMMON blocks, in general, is discouraged, but I am not able to change this for the current project.
Technically speaking, your example code is incorrect since you are using COPYIN to initialise threadprivate copies with data from uninitialised COMMON BLOCK. But that is not the reason for the data race - adding a DATA statement or simply assigning to x, y, and z before the parallel region does not change the outcome.
This is either a (very old) bug in Intel Fortran Compiler, or Intel is interpreting strangely the text of the OpenMP standard (section 2.15.4.1 of the current version):
The copy is done, as if by assignment, after the team is formed and prior to the start of execution of the associated structured block.
Intel implements the emphasised text by inserting a memcpy at the beginning of the outlined procedure. In other words:
!$OMP PARALLEL DO COPYIN(/firstcomm/)
do i = 1, 3000
...
end do
!$OMP END PARALLEL DO
becomes (in a mixture of Fortran and pseudo-code):
par_region0:
my_firstcomm = get_threadprivate_copy(/firstcomm/)
if (my_firstcomm != firstcomm) then
memcpy(my_firstcomm, firstcomm, size of firstcomm)
end if
// Actual implementation of the DO worksharing construct
call determine_iterations(1, 3000, low_it, high_it)
do i = low_it, high_it
...
... my_firstcomm used here instead of firstcomm
...
end do
call openmp_barrier
end par_region0
MAIN:
// Prepare a parallel region with 3 threads
// and fire the outlined code in the worker threads
call start_parallel_region(3, par_region0)
// Fire the outlined code in the master thread
call par_region0
call end_parallel_region
The outlined procedure first finds the address of the threadprivate copy of the common block, then compares that address to the address of the common block itself. If both addresses match, then the code is being executed in the master thread and no copy is needed, otherwise memcpy is called to make a bitwise copy of the master's data into the threadprivate block.
Now, one would expect that there should be a barrier at the end of the initialisation part and right before the start of the loop, and although Intel employees claim that there is one, there is none (tested with ifort 11.0, 14.0, and 16.0). Even more, the Intel Fortran Compiler does not honour the list of variables in the COPYIN clause and copies the entire common block if any variable contained in it is listed in the clause, i.e. COPYIN(x) is treated the same as COPYIN(/firstcomm/).
Whether those are bugs or features of Intel Fortran Compiler, only Intel could tell. It could also be that I'm misreading the assembly output. If anyone could find the missing barrier, please let me know. One possible workaround would be to split the combined directive and insert an explicit barrier before the worksharing construct:
!$OMP PARALLEL COPYIN(/firstcomm/) PRIVATE(I)
!$OMP BARRIER
!$OMP DO
do i = 1, 3000
z = 3.D0
y = z+log10(z)
x = y+z
end do
!$OMP END DO
!$OMP END PARALLEL
With that change, the data race will shift into the initialisation of the internal dispatch table within the log10 call, which is probably a false positive.
GCC implements COPYIN differently. It creates a shared copy of the threadprivate data of the master thread, which copy it then passes on to the worker threads for use in the copy process.

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

Thread issues when writing to files with OpenMP in Fortran

The number of files that are getting written is always less than the number of threads. Logically for me, when I can have 4 threads and the CPU is working at 400%, I was expecting the number of files to be 4 (one each corresponding to every single thread). I don't know if there is a problem with my code or this is how it is supposed to work. The code is as follows:
!!!!!!!! module
module common
use iso_fortran_env
implicit none
integer,parameter:: dp=real64
real(dp):: aa,bb
contains
subroutine evolve(y,yevl)
implicit none
integer(dp),parameter:: id=2
real(dp),intent(in):: y(id)
real(dp),intent(out):: yevl(id)
yevl(1)=y(2)+1.d0-aa*y(1)**2
yevl(2)=bb*y(1)
end subroutine evolve
end module common
use common
implicit none
integer(dp):: iii,iter,i
integer(dp),parameter:: id=2
real(dp),allocatable:: y(:),yt(:)
integer(dp):: OMP_GET_THREAD_NUM, IXD
allocate(y(id)); allocate(yt(id)); y=0.d0; yt=0.d0; bb=0.3d0
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
!$OMP DO
do iii=1,20000; print*,iii !! EXPECTED THREADS TO BE OF 5000 ITERATIONS EACH
aa=1.d0+dfloat(iii-1)*0.4d0/80000.d0
loop1: do iter=1,10 !! THE INITIAL CONDITION LOOP
call random_number(y)!! RANDOM INITIALIZATION OF THE VARIABLE
loop2: do i=1,70000 !! ITERATION OF THE SYSTEM
call evolve(y,yt)
y=yt
enddo loop2 !! END OF SYSTEM ITERATION
write(IXD+1,*)aa,yt !!! WRITING FILE CORRESPONDING TO EACH THREAD
enddo loop1 !!INITIAL CONDITION ITERATION DONE
enddo
!$OMP ENDDO
!$OMP END PARALLEL
end
Is this behavior resulting from some race issue in the code? The code compiles and executes just fine without any warnings or errors with ifort version 13.1.0 on ubuntu. Thanks a bunch for any comments or suggestions.
The variable IXD should be explicitely declared as private to make sure every thread has an own copy of it. Changing the line(s)
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
to
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt,ixd) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
solves the problem.

OpenMP no threading in subroutine

I'm writing a matrix multiplication subroutine in Fortran. I'm using the Intel Fortran compiler. I've written a simple static scheduled parallel do-loop. Unfortunately, it's running on only one thread. Here's the code:
SUBROUTINE MATMULT(A,B,C,L,M,N)
REAL*8 A,B,C
INTEGER NCORES, CHUNK, TID
DIMENSION A(L,N),B(L,M),C(M,N)
PARAMETER (NCORES=8)
CHUNK=(L/(NCORES+1))+1
TID=0
!$OMP PARALLELDO SHARED(A,B,C,L,M,N,CHUNK) PRIVATE(I,J,K,TID)
!$OMP+DEFAULT(NONE) SCHEDULE(STATIC,CHUNK)
DO I=1,L
TID = OMP_GET_THREAD_NUM()
PRINT *, "THREAD ", TID, " ON I=", I
DO K=1,N
DO J=1,M
A(I,K) = A(I,K) + B(I,J)*C(J,K)
END DO
END DO
END DO
!$OMP END PARALLELDO
RETURN
END
Note:
There are no parallel directives in the main program that calls the routine
The arrays A,B,C are initialized serially in the main program. A is initialized to zeros
I am enforcing the Fortran fixed source form during compilation
I have confirmed the following:
Another example program works fine with 8 threads (so no hardware issue)
I have used the -openmp compiler argument
OMP_GET_NUM_PROCS() and OMP_GET_MAX_THREADS() both return 0
TID is 0 for every iteration over I (which shouldn't be the case)
I am unable to diagnose my mistake. I'd appreciate any inputs on this.
The identifier OMP_GET_THREAD_NUM is not explicitly declared. The default implicit typing rules mean it will be of type real. That's not consistent with the declaration in the OpenMP spec for the function of that name.
Adding USE OMP_LIB would fix that issue. Further, not using implicit typing (IMPLICIT NONE) would avoid this and a multitude of similar problems.