Benchmarking fortran loop

Benchmarking fortran loop - fortran

I need to benchmark a part of a fortran program to understand and quantify the impact of specific changes (in order to make the code more maintainable we'd like to make it more OO, taking advantage of function pointers for example).
I have a loop calling several times the same subroutines to perform computations on finite elements. I want to see the impact of using function pointers instead of just hard-coded functions.
do i=1,n_of_finite_elements
! Need to benchmark execution time of this code
end do
What would be a simple way to get the execution time of such a loop, and format it in a nic way ?

I have a github project that measures the performance of passing various arrays at https://github.com/jlokimlin/array_passing_performance.git
It uses the CpuTimer derived data type from https://github.com/jlokimlin/cpu_timer.git.
Usage:
use, intrinsic :: iso_fortran_env, only: &
wp => REAL64, &
ip => INT32
use type_CpuTimer, only: &
CpuTimer
type (CpuTimer) :: timer
real (wp) :: wall_clock_time
real (wp) :: total_processor_time
integer (ip) :: units = 0 ! (optional argument) = 0 for seconds, or 1 for minutes, or 2 for hours
! Starting the timer
call timer%start()
do i = 1, n
!...some big calculation...
end do
! Stopping the timer
call timer%stop()
! Reading the time
wall_clock_time = timer%get_elapsed_time(units)
total_processor_time = timer%get_total_cpu_time(units)
! Write time stamp to standard output
call timer%print_time_stamp()
! Write compiler info to standard output
call timer%print_compiler_info()

Related

Random permutation and random seed

I am employing the Knuth algorithm to generate a random permutation
of an n-tuple. This is the code. Fixed n, it generates random permutations and collect all the different ones until it finds all the n! permutations. At the end it prints also the number of trials needed to find all the permutations. I have also inserted the initialization of the seed from the time (in a very simple and naive way, though). There are two options (A and B). A: The seed is fixed once for all in the main program. B: The seed is fixed every time a random permutation is computed (below the second option is commented).
implicit none
integer :: n,ncomb
integer :: i,h,k,x
integer, allocatable :: list(:),collect(:,:)
logical :: found
integer :: trials
!
! A
!
integer :: z,values(1:8)
integer, dimension(:), allocatable :: seed
call date_and_time(values=values)
call random_seed(size=z)
allocate(seed(1:z))
seed(:) = values(8)
call random_seed(put=seed)
n=4
ncomb=product((/(i,i=1,n)/))
allocate(list(n))
allocate(collect(n,ncomb))
trials=0
h=0
do
trials=trials+1
list=Shuffle(n)
found=.false.
do k=1,h
x=sum(abs(list-collect(:,k)))
if ( x == 0 ) then
found=.true.
exit
end if
end do
if ( .not. found ) then
h=h+1
collect(:,h)=list
print*,h,')',collect(:,h)
end if
if ( h == ncomb ) exit
end do
write(*,*) "Trials= ",trials
contains
function Shuffle(n) result(list)
integer, allocatable :: list(:)
integer, intent(in) :: n
integer :: i, randpos, temp,h
real :: r
!
! B
!
! integer :: z,values(1:8)
! integer, dimension(:), allocatable :: seed
! call date_and_time(values=values)
! call random_seed(size=z)
! allocate(seed(1:z))
! seed(:) = values(8)
! call random_seed(put=seed)
allocate(list(n))
list = (/ (h, h=1,n) /)
do i = n, 2, -1
call random_number(r)
randpos = int(r * i) + 1
temp = list(randpos)
list(randpos) = list(i)
list(i) = temp
end do
end function Shuffle
end
You can check that the second option is not good at all. For n=4 it takes around 100 times more trials to obtain the total number of permutations and for n=5 it gets stuck.
My questions are:
Why does calling random_seed multiple times give wrong results? What kind of systematic error I am introducing? Isn't it equivalent to calling random seed only once but launching the code several times (each time generating only one random permutation)?
If I want to launch several times the code, computing a single permutation, I guess that if I initialize the random seed I have the same problem (regardless the position of the initialization, since now I am computing only one permutation). Correct? In this case, what I have to do in order to initialize the seed wihouth spoiling the uniform sampling? Because if I do not initialize the seed in a random way I obtain the same permutation. I guess I could print and read the seed everytime I launch the code, in order to not to start from the same pseudo-random numbers. However, this is complicated to do if I launch several instances of the code in parallel.
UPDATE
I have understood the reply. In conclusion, if I want to generate pseudorandom numbers at each call by initializing the seed, what I can do is:
A) Old gfortran
Use the subroutine init_random_seed() here
https://gcc.gnu.org/onlinedocs/gcc-5.1.0/gfortran/RANDOM_005fSEED.html
B) Most recent gfortran versions
call random_seed()
C) Fortran2018
call random_init(repeatable, image_distinct)
Questions
In the C) case, should I set repeatable=.false., image_distinct=.true.
to have a different random number each time?
What could be an efficient way to write the code in a portable way, so that
it works whatever the compiler is? (I mean, the code recognizes what is available and works accordingly)

You certainly should not ever call random_seed() repeatedly. It is supposed to be called just once. You make the problem worse by setting it in such a crude way.
Yes, one does often use the data and time to initialize it, but one must pre-process the time data through something that adds some entropy, like through some very simple random-generator. A good example can be found in the documentation of RANDOM_SEED for older versions of gfortran: https://gcc.gnu.org/onlinedocs/gcc-5.1.0/gfortran/RANDOM_005fSEED.html See how lcg() is used there to transform the data_and_time() data.
Note that more recent versions of gfortran will generate a random seed that is different every time just by calling random_seed() without any arguments. Older versions returned the same seed every time.
Also note that Fortran 2018 has random_init() where you can specify repeatable= to be true or false. With false you get a different sequence every time.
The portable thing is to use standard Fortran, that's all. But you cannot use new features and old compiler versions at the same time. This kind of portability does not exist. With old compilers you can only use old standard features. I won't even start writing about autoconf and stuff, it is not worth it.
So,
you can set your random number seed to be the same every time or distinct every time (see above),
and
you should always call random_seed or random_init only once.
Why does calling random_seed multiple times give wrong results?
You are re-starting the pseudorandom sequence to some unspecified state, probably with insufficient entropy. Quite easily quite close to the last starting state.
What kind of systematic error I am introducing? Isn't it equivalent to calling random seed only once but launching the code several times (each time generating only one random permutation)?
It might be similar. But your seeding using the time is way too naïve and when running in the loop the date and time is way too similar if not completely equal in most bits. Some transform as linked above might mask that problem anyway but putting the date and time itself as your seed is just not going to work.

gfortran how do I increment random_seed by 2^128

The gfortran page on random_seed says that when using OMP threads, each thread increments its seed by 2^128. I am wondering how I increment the seed by 2^128 manually. I wrote a little test program to set the master seed at all 0, and then see what the seeds were, but I don't understand what I'm seeing. What I'd like to know is for example what I put in the subroutine increment_by_2_tothe_128
program main
implicit none
character(len=32) :: arg
integer :: n
integer :: i
integer :: nthreads
integer, allocatable :: seed(:, :)
integer, allocatable :: master_seed(:)
real, allocatable :: rn(:)
call get_command_argument(1, arg)
read(arg, *) nthreads
call random_seed(size=n)
allocate(seed(n, nthreads))
allocate(master_seed(n))
allocate(rn(nthreads))
master_seed = 0
seed = 0
call random_seed(put=master_seed)
! call increment_by_2_tothe_128(n)
call omp_set_num_threads(nthreads)
!$OMP PARALLEL DO
do i=1,nthreads
call random_number(rn(i))
call random_seed(get=seed(:,i))
end do
do i=1,nthreads
print *, i
print *, rn(i)
print *, seed(:,i)
end do
end program main
subroutine increment_by_2_tothe_128(n)
implicit none
integer, intent(in) :: n
integer :: current_seed(n)
integer :: increment_seed(n)
call random_seed(get=current_seed)
! what goes here:
! incrememt_seed = current_seed + 2**128
call random_seed(put=increment_seed)
end subroutine increment_by_2_tothe_128

You cannot do that manually. You need the access to the random number generator to be able to do that, but the internals are not exposed to Fortran programmers. And you obviously cannot call the generator 2^128 times.
If you need to do the shift, you need to use some pseud-random number generator that does expose the internals and at the same time allows this kind of shift. That can be, for example, the xoroshiro PRNG family that is used internally by gfortran. These generators have a specialized function for this shift:
All generators, being based on linear recurrences, provide jump
functions that make it possible to simulate any number of calls to the
next-state function in constant time, once a suitable jump polynomial
has been computed. We provide ready-made jump functions for a number
of calls equal to the square root of the period, to make it easy
generating non-overlapping sequences for parallel computations, and
equal to the cube of the fourth root of the period, to make it
possible to generate independent sequences on different parallel
processors.
These generators are most often implemented in C, but Fortran implementations also exist (subroutine rng_jump is the jump function, disclaimer: the link goes to my repository, no guarantees for the quality).

Right way to communicate portable Fortran data types with MPI [duplicate]

I have a Fortran program where I specify the kind of the numeric data types in an attempt to retain a minimum level of precision, regardless of what compiler is used to build the program. For example:
integer, parameter :: rsp = selected_real_kind(4)
...
real(kind=rsp) :: real_var
The problem is that I have used MPI to parallelize the code and I need to make sure the MPI communications are specifying the same type with the same precision. I was using the following approach to stay consistent with the approach in my program:
call MPI_Type_create_f90_real(4,MPI_UNDEFINED,rsp_mpi,mpi_err)
...
call MPI_Send(real_var,1,rsp_mpi,dest,tag,MPI_COMM_WORLD,err)
However, I have found that this MPI routine is not particularly well-supported for different MPI implementations, so it's actually making my program non-portable. If I omit the MPI_Type_create routine, then I'm left to rely on the standard MPI_REAL and MPI_DOUBLE_PRECISION data types, but what if that type is not consistent with what selected_real_kind picks as the real type that will ultimately be passed around by MPI? Am I stuck just using the standard real declaration for a datatype, with no kind attribute and, if I do that, am I guaranteed that MPI_REAL and real are always going to have the same precision, regardless of compiler and machine?
UPDATE:
I created a simple program that demonstrates the issue I see when my internal reals have higher precision than what is afforded by the MPI_DOUBLE_PRECISION type:
program main
use mpi
implicit none
integer, parameter :: rsp = selected_real_kind(16)
integer :: err
integer :: rank
real(rsp) :: real_var
call MPI_Init(err)
call MPI_Comm_rank(MPI_COMM_WORLD,rank,err)
if (rank.eq.0) then
real_var = 1.123456789012345
call MPI_Send(real_var,1,MPI_DOUBLE_PRECISION,1,5,MPI_COMM_WORLD,err)
else
call MPI_Recv(real_var,1,MPI_DOUBLE_PRECISION,0,5,MPI_COMM_WORLD,&
MPI_STATUS_IGNORE,err)
end if
print *, rank, real_var
call MPI_Finalize(err)
end program main
If I build and run with 2 cores, I get:
0 1.12345683574676513672
1 4.71241976735884452383E-3998
Now change the 16 to a 15 in selected_real_kind and I get:
0 1.1234568357467651
1 1.1234568357467651
Is it always going to be safe to use selected_real_kind(15) with MPI_DOUBLE_PRECISION no matter what machine/compiler is used to do the build?

Use the Fortran 2008 intrinsic STORAGE_SIZE to determine the number bytes that each number requires and send as bytes. Note that STORAGE_SIZE returns the size in bits, so you will need to divide by 8 to get the size in bytes.
This solution works for moving data but does not help you use reductions. For that you will have to implement a user-defined reduction operation. If that's important to you, I will update my answer with the details.
For example:
program main
use mpi
implicit none
integer, parameter :: rsp = selected_real_kind(16)
integer :: err
integer :: rank
real(rsp) :: real_var
call MPI_Init(err)
call MPI_Comm_rank(MPI_COMM_WORLD,rank,err)
if (rank.eq.0) then
real_var = 1.123456789012345
call MPI_Send(real_var,storage_size(real_var)/8,MPI_BYTE,1,5,MPI_COMM_WORLD,err)
else
call MPI_Recv(real_var,storage_size(real_var)/8,MPI_BYTE,0,5,MPI_COMM_WORLD,&
MPI_STATUS_IGNORE,err)
end if
print *, rank, real_var
call MPI_Finalize(err)
end program main
I confirmed that this change corrects the problem and the output I see is:
0 1.12345683574676513672
1 1.12345683574676513672

Not really an answer, but we have the same problem and use something like this:
!> Number of digits for single precision numbers
integer, parameter, public :: single_prec = 6
!> Number of digits for double precision numbers
integer, parameter, public :: double_prec = 15
!> Number of digits for extended double precision numbers
integer, parameter, public :: xdble_prec = 18
!> Number of digits for quadruple precision numbers
integer, parameter, public :: quad_prec = 33
integer, parameter, public :: rk_prec = double_prec
!> The kind to select for default reals
integer, parameter, public :: rk = selected_real_kind(rk_prec)
And then have an initialization routine where we do:
!call mpi_type_create_f90_real(rk_prec, MPI_UNDEFINED, rk_mpi, iError)
!call mpi_type_create_f90_integer(long_prec, long_k_mpi, iError)
! Workaround shitty MPI-Implementations.
select case(rk_prec)
case(single_prec)
rk_mpi = MPI_REAL
case(double_prec)
rk_mpi = MPI_DOUBLE_PRECISION
case(quad_prec)
rk_mpi = MPI_REAL16
case default
write(*,*) 'unknown real type specified for mpi_type creation'
end select
long_k_mpi = MPI_INTEGER8
While this is not nice, it works reasonably well, and seems to be usable on Cray, IBM BlueGene and conventional Linux Clusters.
Best thing to do is push sites and vendors to properly support this in MPI. As far as I know it has been fixed in OpenMPI and planned to be fixed in MPICH by 3.1.1. See OpenMPI Tickets 3432 and 3435 as well as MPICH Tickets 1769 and 1770.

How about:
integer, parameter :: DOUBLE_PREC = kind(0.0d0)
integer, parameter :: SINGLE_PREC = kind(0.0e0)
integer, parameter :: MYREAL = DOUBLE_PREC
if (MYREAL .eq. DOUBLE_PREC) then
MPIREAL = MPI_DOUBLE_PRECISION
else if (MYREAL .eq. SINGLE_PREC) then
MPIREAL = MPI_REAL
else
print *, "Erorr: Can't figure out MPI precision."
STOP
end if
and use MPIREAL instead of MPI_DOUBLE_PRECISION from then on.

Poor scaling and a segmentation fault in a Fortran OpenMP code

I'm having some trouble when executing a program with a parallel do. Here is a test code.
module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains
subroutine Addition(x,y,s)
real(dp),intent(in) :: x,y
real(dp), intent(out) :: s
s = x+y
end subroutine Addition
function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
real(dp),intent(in) :: xi,xf
integer, intent(in) :: length
real(dp),dimension(1:length) :: vec
integer ::i
real(dp) :: increment
increment = (xf-xi)/(real(length)-1)
vec(1) = xi
do i = 2,length
vec(i) = vec(i-1) + increment
end do
end function linspace
end module test
program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix
x = linspace(length,.0d0,1.0d0)
y = linspace(length,2.0d0,3.0d0)
!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
do j = 1,size(y)
call Addition(x(i),y(j),s)
SMatrix(i,j) = s
end do
end do
!$OMP END PARALLEL DO
open(unit=1,file ='Add6.dat')
do i= 1,size(x)
do j= 1,size(y)
write(1,*) x(i),";",y(j),";",SMatrix(i,j)
end do
end do
close(unit=1)
end program paralleltest
I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8
This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.

As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.
Regarding the time, almost all of the runtime is spent in writing the array to the external file.
But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).
As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.
And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).
With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.
If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.
Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

No speedup with OpenMP in DO loop [duplicate]

I have a Fortran 90 program calling a multi threaded routine. I would like to time this program from the calling routine. If I use cpu_time(), I end up getting the cpu_time for all the threads (8 in my case) added together and not the actual time it takes for the program to run. The etime() routine seems to do the same. Any idea on how I can time this program (without using a stopwatch)?

Try omp_get_wtime(); see http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html for the signature.

If this is a one-off thing, then I agree with larsmans, that using gprof or some other profiling is probably the way to go; but I also agree that it is very handy to have coarser timers in your code for timing different phases of the computation. The best timing information you have is the stuff you actually use, and it's hard to beat stuff that's output every single tiem you run your code.
Jeremia Wilcock pointing out omp_get_wtime() is very useful; it's standards compliant so should work on any OpenMP compiler - but it only has second resolution, which may or may not be enough, depending on what you're doing. Edited; the above was completely wrong.
Fortran90 defines system_clock() which can also be used on any standards-compliant compiler; the standard doesn't specify a time resolution, but gfortran it seems to be milliseconds and ifort seems to be microseconds. I usually use it in something like this:
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
And using them:
call tick(calc)
! do big calculation
calctime = tock(calc)
print *,'Timing summary'
print *,'Calc: ', calctime

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js