No speedup with OpenMP in DO loop [duplicate] - fortran

I have a Fortran 90 program calling a multi threaded routine. I would like to time this program from the calling routine. If I use cpu_time(), I end up getting the cpu_time for all the threads (8 in my case) added together and not the actual time it takes for the program to run. The etime() routine seems to do the same. Any idea on how I can time this program (without using a stopwatch)?

Try omp_get_wtime(); see http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html for the signature.

If this is a one-off thing, then I agree with larsmans, that using gprof or some other profiling is probably the way to go; but I also agree that it is very handy to have coarser timers in your code for timing different phases of the computation. The best timing information you have is the stuff you actually use, and it's hard to beat stuff that's output every single tiem you run your code.
Jeremia Wilcock pointing out omp_get_wtime() is very useful; it's standards compliant so should work on any OpenMP compiler - but it only has second resolution, which may or may not be enough, depending on what you're doing. Edited; the above was completely wrong.
Fortran90 defines system_clock() which can also be used on any standards-compliant compiler; the standard doesn't specify a time resolution, but gfortran it seems to be milliseconds and ifort seems to be microseconds. I usually use it in something like this:
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
And using them:
call tick(calc)
! do big calculation
calctime = tock(calc)
print *,'Timing summary'
print *,'Calc: ', calctime

Related

Poor scaling and a segmentation fault in a Fortran OpenMP code

I'm having some trouble when executing a program with a parallel do. Here is a test code.
module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains
subroutine Addition(x,y,s)
real(dp),intent(in) :: x,y
real(dp), intent(out) :: s
s = x+y
end subroutine Addition
function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
real(dp),intent(in) :: xi,xf
integer, intent(in) :: length
real(dp),dimension(1:length) :: vec
integer ::i
real(dp) :: increment
increment = (xf-xi)/(real(length)-1)
vec(1) = xi
do i = 2,length
vec(i) = vec(i-1) + increment
end do
end function linspace
end module test
program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix
x = linspace(length,.0d0,1.0d0)
y = linspace(length,2.0d0,3.0d0)
!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
do j = 1,size(y)
call Addition(x(i),y(j),s)
SMatrix(i,j) = s
end do
end do
!$OMP END PARALLEL DO
open(unit=1,file ='Add6.dat')
do i= 1,size(x)
do j= 1,size(y)
write(1,*) x(i),";",y(j),";",SMatrix(i,j)
end do
end do
close(unit=1)
end program paralleltest
I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8
This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.
As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.
Regarding the time, almost all of the runtime is spent in writing the array to the external file.
But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).
As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.
And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).
With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.
If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.
Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

Fortran execution time

I am new with Fortran and I would like to ask for help. My code is very simple. It just enters a loop and then using system intrinsic procedure enters the file with the name code and runs the evalcode.x program.
program subr1
implicit none
integer :: i,
real :: T1,T2
call cpu_time(T1)
do i=1,6320
call system ("cd ~/code; ../evalcede/source/evalcode.x test ")
enddo
call cpu_time(T2)
print *, T1,T2
end program subr1
The time measured that the program is actually running is 0.5 sec, but time that this code actually needs for execution is 1.5 hours! The program is suspended or waiting and I do not know why.
note: this is more an elaborated comment to the post of Janneb to provide a bit more information.
As indicated by Janneb, the function CPU_TIME does not necesarily return wall-clock time, what you are after. This especially when timing system calls.
Furthermore, the output of CPU_TIME is really a processor and compiler dependent value. To demonstrate this, the following code is compiled with gfortran, ifort and solaris-studio f90:
program test_cpu_time
real :: T1,T2
call cpu_time(T1)
call execute_command_line("sleep 5")
call cpu_time(T2)
print *, T1,T2, T2-T1
end program test_cpu_time
#gfortran>] 1.68200000E-03 1.79799995E-03 1.15999952E-04
#ifort >] 1.1980000E-03 1.3410000E-03 1.4299992E-04
#f90 >] 0.0E+0 5.00534 5.00534
Here, you see that both gfortran and ifort exclude the time of the system-command while solaris-studio includes the time.
In general, one should see the difference between the output of two consecutive calls to CPU_TIME as the time spend by the CPU to perform the actions. Due to the system call, the process is actually in a sleep state during the time of execution and thus no CPU time is spent. This can be seen by a simple ps:
$ ps -O ppid,nlwp,psr,stat $(pgrep sleep) $(pgrep a.out)
PID PPID NLWP PSR STAT S TTY TIME COMMAND
27677 17146 1 2 SN+ S pts/40 00:00:00 ./a.out
27678 27677 1 1 SN+ S pts/40 00:00:00 sleep 5
NLWP indicates how many threads in use
PPID indicates parent PID
STAT indicates 'S' for interruptible sleep (waiting for an event to complete)
PSR is the cpu/thread it is running on.
You notice that the main program a.out is in a sleep state and both the system call and the main program are running on separate cores. Since the main program is in a sleep state, the CPU_TIME will not clock this time.
note: solaris-studio is the odd duck, but then again, it's solaris studio!
General comment: CPU_TIME is still useful for determining the execution time of segments of code. It is not useful for timing external programs. Other more dedicated tools exist for this such as time: The OP's program could be reduced to the bash command:
$ time ( for i in $(seq 1 6320); do blabla; done )
This is what the standard has to say on CPU_TIME(TIME)
CPU_TIME(TIME)
Description: Return the processor time.
Note:13.9: A processor for which a single result is inadequate (for example, a parallel processor) might choose to
provide an additional version for which time is an array.
The exact definition of time is left imprecise because of the variability in what different processors are able
to provide. The primary purpose is to compare different algorithms on the same processor or discover which
parts of a calculation are the most expensive.
The start time is left imprecise because the purpose is to time sections of code, as in the example.
Most computer systems have multiple concepts of time. One common concept is that of time expended by
the processor for a given program. This might or might not include system overhead, and has no obvious
connection to elapsed “wall clock” time.
source: Fortran 2008 Standard, Section 13.7.42
On top of that:
It is processor dependent whether the results returned from CPU_TIME, DATE_AND_TIME and SYSTEM_CLOCK are dependent on which image calls them.
Note 13.8: For example, it is unspecified whether CPU_TIME returns a per-image or per-program value, whether all
images run in the same time zone, and whether the initial count, count rate, and maximum in SYSTEM_CLOCK are the same for all images.
source: Fortran 2008 Standard, Section 13.5
The CPU_TIME intrinsic measures CPU time consumed by the program itself, not including those of it's subprocesses (1).
Apparently most of the time is spent in evalcode.x which explains why the reported wallclock time is much higher.
If you want to measure wallclock time intervals in Fortran, you can use the SYSTEM_CLOCK intrinsic.
(1) Well, that's what GFortran does, at least. The standard doesn't specify exactly what it means.

The value given by cpu_time does not change over little time

I want to check how much time does it take the computer to compute a function. To do this I wanted to compare the values given by the cpu_time subroutine before and after calling my function. To my surprise, the value before and after was the same as if it took zero time to perform the function. To check it, I created a simple piece of code
call cpu_time(time)
write(*,*) time
do i=1,10000
!simple math equation here
end do
call cpu_time(time)
write(*,*) time
And after running the program the value printed before and after the loop was exaclty the same. My guess is that the system clock is not precise enough to dinstinguish such a little changes, but does it really make sense? All in all I don't know how to measure the time needed for executing my function without this working properly.

Program is not any faster with OpenMP

My goal is to parallelize a section in my Fortran program. The flow of the program is:
Read data from a file
make some computations
write the results to 2 different files
Here I want to parallelize the writing process since I’m writing into different files.
module foo
use omp_lib
implicit none
type element
integer, dimension(:), allocatable :: v1, v2
real(kind=8), dimension(:,:), allocatbale :: M
end type element
contains
subroutine test()
implicit none
type(element) :: e
do
e = read_data_from_file()
call compute_data(e)
!$OMP SECTIONS
!$OMP SECTION
!$ call write_to_file1(e)
!$OMP SECTION
!$ call write_to_file2(e)
!$OMP END SECTIONS
end do
end subroutine test
...
end module foo
But this program isn't going anything faster. So I think that I’m missing something?
In general one can divide scientific computing codes in bandwidth bound and computational bound algorithms. The bandwidth bound algorithms are all that only do few operations on the data they need. Like having O(n) data where O(n) flops are performed on. Thinking of the hard disk speed or the network connection speed, I/O is a bandwidth bound operation as well and therefore not or only badly parallelizable.
If you really want to gain performance out of the parallelization split the code into bandwidth bound and computational bound algorithms and use your time to parallelize the later ones.
If you specify you problem more precisely there are hundreds of experts eager to solve it. From the comment to the answer above I see that you are using binary output but still has bandwidth left to write faster, that means that you disk speed is fine and you're not limited by parsing, but rather that you actual program is not putting out data in a faster pace than this.
So optimize your code, to make it catch up with your write-speed, instead of increasing the write speed with an equally slow code.
Writing them 2 files sequentially at the max of your bandwidth is as fast and much easier than writing in parallel (at the same max speed).
If I am mistaken, and you are indeed limited by IO, maybe this other question/answer can help you: How to avoid programs in status D.

High Performance Fortran (HPF) without directives?

In High Performance Fortran (HPF), I could specify the distribution of arrays involved in a parallel calculation using the DISTRIBUTE directive. For example, the following minimal subroutine will sum two arrays in parallel:
subroutine mysum(x,y,z)
integer, intent(in) :: y(10000), z(10000)
integer, intent(out) :: x(10000),
!HPF$ DISTRIBUTE x(BLOCK), y(BLOCK), z(BLOCK)
x = y + z
end subroutine mysum
My question is, is the DISTRIBUTE directive necessary? I know in practise this is of little interest, but I'm curious as to whether an unadorned, directive-free, Fortran program could also be a valid HPF program?
I do not believe DISTRIBUTE statement is necessary, and I never used it.
You can achieve this implicitly by using FORALL statements instead of DO loops where applicable. Originally, DO loops would give explicit order of operation on array elements, whereas FORALL would allow the processor to determine an optimal order at runtime. I do not think this makes much difference nowadays, because modern compilers are able to optimize/vectorize/parallelize DO loops where possible. I cannot tell for sure for other compilers, but I remember using Intel Fortran Compiler to compile and run a program on 2 and 4 processors in parallel without using DISTRIBUTE.
However, depending on the processor architecture and compiler, it is best to try out what you have and see what gives you optimal results or efficiency.