Declare all Fortran module variables target OpenMP 4.5+ - fortran

I have a fortran90 code that use chemical species properties (i.e. molecular weight, viscosity, etc.) for calculations.
To easily swap in and out groups of chemical species, we keep module files that store all the relevant data in 1D arrays. I.e. we have 4 species, the viscosity array is 4 elements long, one entry for each species and so on.
The relevant subroutines that need this data can then use this module, and the chemical data is available as needed.
We have ported a majority of the code to GPU offloading with openMP 4.5 and at the point of porting over these chemical calculations.
What I would like to do is just place the entire module onto the GPU, so that any subroutines that use these module variables have access to them on the target device.
My initial though was to just !$omp declare target the module like we do functions or other subroutines, but that doesn't seem to be accepted by the compiler.
Do I really have to !$omp declare target(variable_x, y,z,a,b,c......) for the entire module?
And if I do that, what is the scope then of these variables? Are they accessible to everything on the device now even if a subroutine doesn't use the module? Or is the compiler smart enough to keep them within the module scope of the subroutine using them?
Lastly, is there anything special that needs to be done to a subroutine that uses these modules when I am just creating target regions within the subroutine? For example:
subroutine test
use chem_module
implicit none
integer :: i
!$omp parallel do
do i=1,100
*do some calcs with module data
*do I need to tell the compiler about the chem_module module?
end do
!$omp end parallel do
end subroutine
Thanks for taking a look!

Turns out that just part of the Fortran API... it requires a list and you can't do an encompassing
!$omp declare target
declare stuff...
!$omp end declare target
So yes, you need a massive list as far as I know.

Related

Gestion of memory in the modular structure of a Fortran 95 program with heavy computations and variables

I am currently "optimizing" a scientific modelling program developed in Fortran 95. This program is basically making heavy computations in 3D to solve some equations, in addition numerous variable have to be saved and used ~ 50 tables with sizes likes (50; 50; 10000), I even have some 5D tables with sizes like (6;6;15;15;10000) to save in order to reduce the computation time.
I developed a perfectly working version of this code using a python3 interface to control my runs. Basically python is calling a fortran module containing my code to obtain all the results from my modelling. The problem with this method is that I cannot parallelize my code in some time consuming regions. Moreover, I would benefit from the computational time advantage of Fortran for a post treatment of the models that is now partially done in python due to interface.
In the first part of my optimization campaign for this code I want to add a control of the runs with Fortran. A program would call the module containing my code to obtain all the necessary and heavy variables. The Python interface would still be presented, the switch between the Fortran and python control run being done in the compilation in the Makefile directly, this Makefile is already done, everything is compiling well and the python interface is still perfectly working.
My troubles are concerning the Fortran control program and its gestion of the allocated memory I assume. As the size of my tables are not known in advance and requires to open some files I have to declare all my variable as ALLOCATABLE. I then allocate them with the correct sizes before calling my module containing my code. When calling my code errors related to memory problems are appearing, with the error message "Program received signal SIGSEV: Segmentation fault - invalid memory reference". This error appears when I'm setting a table to 0d0, if I'm reducing the size/precision of my modelling the program can proceed a bit further before crashing hence the memory related problem. I think that I'm doing something not correct in the utilisation of the variables between my control and my modelling module. Maybe some variables are stored in the wrong memory space, I precise that I'm using gfortran on ubuntu 22.04.1.
I have different possibilities to try to solve this issue using derived types and pointers or simply by breaking my modelling module. Before going into these heavy structural modifications I wanted to know if someone has experience an equivalent problem and what were the solutions.
Here is a schema of the structure of my code:
Run program:
program run_model
use coordinates
use file
use mathematical
use modelling_module
implicit none
integer :: n_x, n_y, n_z
real(8),dimension(:), ALLOCATABLE:: x,y,z
+ all other output variables in 3D
.
.
.
Some operations and file opening
ALLOCATE(x(n_x),y(n_y),z(n_z))
+ all other variables
CALL modelling(n_x, n_y, n_z, output variables)
end program run_model
Modelling module in a separated file:
module modelling_module
use coordinates
use file
use mathematical
implicit none
private
public :: modelling
contains
subroutine modelling(n_x, n_y, n_z, output variables)
integer, intent(in):: n_x, n_y, n_z,
real(8),dimension(n_x), intent(out):: x
real(8),dimension(n_y), intent(out):: y
real(8),dimension(n_z), intent(out):: z
+ all output variables
Computation of the model
.
.
.
end subroutine modelling
end module modelling_module
Thank you in advance for your answers !

Program is not any faster with OpenMP

My goal is to parallelize a section in my Fortran program. The flow of the program is:
Read data from a file
make some computations
write the results to 2 different files
Here I want to parallelize the writing process since I’m writing into different files.
module foo
use omp_lib
implicit none
type element
integer, dimension(:), allocatable :: v1, v2
real(kind=8), dimension(:,:), allocatbale :: M
end type element
contains
subroutine test()
implicit none
type(element) :: e
do
e = read_data_from_file()
call compute_data(e)
!$OMP SECTIONS
!$OMP SECTION
!$ call write_to_file1(e)
!$OMP SECTION
!$ call write_to_file2(e)
!$OMP END SECTIONS
end do
end subroutine test
...
end module foo
But this program isn't going anything faster. So I think that I’m missing something?
In general one can divide scientific computing codes in bandwidth bound and computational bound algorithms. The bandwidth bound algorithms are all that only do few operations on the data they need. Like having O(n) data where O(n) flops are performed on. Thinking of the hard disk speed or the network connection speed, I/O is a bandwidth bound operation as well and therefore not or only badly parallelizable.
If you really want to gain performance out of the parallelization split the code into bandwidth bound and computational bound algorithms and use your time to parallelize the later ones.
If you specify you problem more precisely there are hundreds of experts eager to solve it. From the comment to the answer above I see that you are using binary output but still has bandwidth left to write faster, that means that you disk speed is fine and you're not limited by parsing, but rather that you actual program is not putting out data in a faster pace than this.
So optimize your code, to make it catch up with your write-speed, instead of increasing the write speed with an equally slow code.
Writing them 2 files sequentially at the max of your bandwidth is as fast and much easier than writing in parallel (at the same max speed).
If I am mistaken, and you are indeed limited by IO, maybe this other question/answer can help you: How to avoid programs in status D.

Fortran Modules with OpenMP

It seems to me that fortran modules, which can be used to hold global variables across subroutines, don't work the same when using OpenMP. Here's an example:
main.f90
program main
use mod
implicit none
!$OMP PARALLEL private(a)
!$OMP DO
do i=1,10
a=i-1
print*,"a =",a
call echo
print*,"b =",b
enddo
!$OMP END DO
!$OMP END PARALLEL
end program main
echo.f90
subroutine echo
use mod
implicit none
b=a+1
!print *,a,"+1=",b
end subroutine echo
mod.f90
module mod
integer:: i,a,b
end module mod
Now if you compile and run this without OpenMP you get:
a = 0
b = 1
a = 1
b = 2
a = 2
.....ect. This is what you'd expect
But, if you compile WITH openMP you get:
a = 7
b = 1
a = 6
b = 1
a = 8
.....ect. This is not what I want. I know that the echo subroutine is getting 'a' from the module, not the private 'a' that the thread has. Is there any way to do this besides passing it as an argument? There are a ton of variables in my module and it would be tedious.
Inside the procedure echo, a and b are variables that are referenced in a region but not in a construct - execution wise they appear in between a matching !$OMP PARALLEL and !$OMP END PARALLEL directive, but source wise they do not. As they are module variables, and in the absence of directives to the contrary, the rules for data sharing attributes in 2.14.1.2 of the OpemMP 4.0 standard specify that those variables inside the procedure are shared.
Consequently your example code has a data race, with multiple threads writing to b inside the echo subroutine without synchronization.
You can use the THREADPRIVATE directive in the module to change the data sharing attribute of those module variables. You will need to remove the private specification for a at the same time.
In the long run, a far better approach may be to make the flows of information in your program explicit to a reader of the code (and more flexibly configurable by a code writer), by passing information as arguments (perhaps bundled together in derived types) rather than hiding those flows through the use of global (module) variables.

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

Creating a subroutine that accepts different kinds of reals

I want to implement a subroutine that can work with reals in single precision, double precision and extended precision. The only solution I can come up with is shown in the code below. This solution works but I have to duplicate the code 3 times. Can this code duplication be avoided?
module mymodule
....
! some code here
interface my_func
module procedure my_func_sp
module procedure my_func dp
module procedure my_func_ep
end interface
contains
subroutine my_func_sp(x,y)
real(kind=sp), dimension(:) :: x,y
... LONG IMPLEMENTATION HERE ...
end subroutine
subroutine my_func_dp(x,y)
real(kind=dp), dimension(:) :: x,y
... LONG IMPLEMENTATION HERE THAT IS EXACTLY THE SAME AS ABOVE ...
end subroutine
subroutine my_func_ep(x,y)
real(kind=ep), dimension(:) :: x,y
... LONG IMPLEMENTATION HERE THAT IS EXACTLY THE SAME AS THE TWO ABOVE ...
end subroutine
end module
Can this code duplication be avoided? Not really, this is the way Fortran works. You could:
Write the code once, for the highest-precision kind you care about, and have the other subroutines call that variant, casting the kinds of variables on the way in and out.
Another approach I have seen regularly is to write the computational statements in a file and to include that file in each of the subroutines. Just take care that the included statements are valid for all kinds of the type. Take care too that the same statements work across kinds. If, for example, your included lines include comparisons with a tolerance, as many numeric codes do, you may have to take special care that the tolerance is adjusted wrt the kind.
If your entire code will use single, double, or quadruple precision reals, you can define a parameter real_kind in a module and use that parameter to specify kinds throughout your code, including the declarations of real variables in your subroutine. This solution does not work if your code calls more than one of my_func_sp, my_func_dp, and my_func_ep in a single run.