Program is not any faster with OpenMP - fortran

My goal is to parallelize a section in my Fortran program. The flow of the program is:
Read data from a file
make some computations
write the results to 2 different files
Here I want to parallelize the writing process since I’m writing into different files.
module foo
use omp_lib
implicit none
type element
integer, dimension(:), allocatable :: v1, v2
real(kind=8), dimension(:,:), allocatbale :: M
end type element
contains
subroutine test()
implicit none
type(element) :: e
do
e = read_data_from_file()
call compute_data(e)
!$OMP SECTIONS
!$OMP SECTION
!$ call write_to_file1(e)
!$OMP SECTION
!$ call write_to_file2(e)
!$OMP END SECTIONS
end do
end subroutine test
...
end module foo
But this program isn't going anything faster. So I think that I’m missing something?

In general one can divide scientific computing codes in bandwidth bound and computational bound algorithms. The bandwidth bound algorithms are all that only do few operations on the data they need. Like having O(n) data where O(n) flops are performed on. Thinking of the hard disk speed or the network connection speed, I/O is a bandwidth bound operation as well and therefore not or only badly parallelizable.
If you really want to gain performance out of the parallelization split the code into bandwidth bound and computational bound algorithms and use your time to parallelize the later ones.

If you specify you problem more precisely there are hundreds of experts eager to solve it. From the comment to the answer above I see that you are using binary output but still has bandwidth left to write faster, that means that you disk speed is fine and you're not limited by parsing, but rather that you actual program is not putting out data in a faster pace than this.
So optimize your code, to make it catch up with your write-speed, instead of increasing the write speed with an equally slow code.
Writing them 2 files sequentially at the max of your bandwidth is as fast and much easier than writing in parallel (at the same max speed).
If I am mistaken, and you are indeed limited by IO, maybe this other question/answer can help you: How to avoid programs in status D.

Related

Fortran Openmp: Ghost threads interrupt application upon allocate

I am experiencing an issue which starts to drive me crazy. As part of a large software tool I am writing a Fortran module to perform astrodynamic computations.
The main functionality is contained in a subroutine which uses OpenMP and other subroutines to perform its tasks.
The workflow is as follows:
1) Preparations - read in files
2) Start the first parallel omp parallel do
3) Based on the working mode, the function might call itself recursively and process a separate part of the data (thereby executing a second omp do loop)
4) Combine the results into one large array of a derived type
Up to this point everything has been running as expected. Next is the part that fails:
5) Using the results from 4, start another parallel loop to continue the processing.
Step 5 sometimes works and sometimes it stops with an access violation. This is independently of the optimization level selected (nominally we use O2, but it happens also with O0).
Hence I fired up Intel inspector and looked for data races/deadlocs,....
It reported 5 very strange data races, which make no sense to me, as the code locations are either at the definition/end of a subroutine, a codeline reading threadprivate global variable, an Intel MKL routine or at a location reading a local allocatable array which is in a subroutine that is called within the parallel region.
After reading upon things a bit, I enabled the recursive switch to force the local array to end on the stack. Another Inspector analysis was interrupted by the same access violation. When loading the results no errors were detected, but the tool warns that due to the abnormal end data may have been lost.
I then tried to put the entire loop into an OMP critical - just as a test. Now the access violation disappeared, but the code is stuck very early.After processing 15/500 objects it just stops. This looked like a deadlock.
So I continued the search and commented out every OMP statement in the do loop of step 5.
Interestingly the code stops as well at iteration 15!
Hence I used the debugger to locate the call where the infinite waiting state occurs.
It is at the final allocate statement:
Subroutine doWork(t0, Pert, allAtmosData, considerArray, perigeeFirstTimeStepUncertainties)
real(dp), intent(in) :: t0
type (tPert), intent(in) :: Pert
type (tThermosphereData), dimension(:), allocatable, intent(in) :: allAtmosData
integer(i4), dimension(3), intent(in) :: considerArray
real(dp), dimension(:,:), allocatable, intent(out) :: perigeeFirstTimeStepUncertainties
!Locals:
real(dp), dimension(:), allocatable :: solarFluxPerigeeUnc, magneticIndexUnc, modelUnc
integer(i4) :: dataPoints
!Allocate memory for the computations
dataPoints = size(allAtmosData, 1)
allocate(solarFluxPerigeeUnc(0:dataPoints-1), source=0.0_dp)
allocate(magneticIndexUnc(0:dataPoints-1), source=0.0_dp)
allocate(modelUnc(0:dataPoints-1), source=0.0_dp)
... Subroutine body ...
!Assign outputs
allocate(perigeeFirstTimeStepUncertainties(3, 0:dataPoints-1), source=0.0_dp)
perigeeFirstTimeStepUncertainties(MGN_SOLAR_FLUX_UNCERTAINTY,:) = solarFluxPerigeeUnc
perigeeFirstTimeStepUncertainties(MGN_MAG_INDEX_UNCERTAINTY,:) = magneticIndexUnc
perigeeFirstTimeStepUncertainties(MGN_MODEL_UNCERTAINTY,:) = modelUnc
End Subroutine
The subroutine works perfectly in other places of the overall program and also for the first
14 iterations of step 5. In the 15th iteration however it is stuck at the last allocate, or as I just have managed to produce - also results in an access violation.
When I pause the code using the debugger, I see that somehow the openmp library has been loaded and the code hangs/crashes in frontend.cpp, which I have no access to:
What is happening here? How can a simple allocate suddenly cause and OMP activity in a loop where any OMP statement has been commented out?
As a sidenote: If I comment out the allocate and pass the size as an extra argument, the same error happens at the next allocate in the code.
Any help or workaround is highly appreciated! I already googled on how to shut down any prior thread-pool, so that there is definately no omp ghost thread activity left, but apparently there is no way to do so. Also ultimately the loop shall work in parallel.
EDIT: I just did a second test on the machine with 2018 v5. It also shows that the allocate statement results in a call to for_alloc_allocatable() and then the OMP statements. Also the mov statement where the program crashes is shown:
I guess that something is probably causing memory corruption in the second recursive OMP loop, as somehow the allocate is redirected to a threadpool memory allocation, however all threads but the main one should be passive? I cannot read assembler, but I was surprised to find that TBB is used by OpenMp? Could this be a bug in TBB?

Poor scaling and a segmentation fault in a Fortran OpenMP code

I'm having some trouble when executing a program with a parallel do. Here is a test code.
module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains
subroutine Addition(x,y,s)
real(dp),intent(in) :: x,y
real(dp), intent(out) :: s
s = x+y
end subroutine Addition
function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
real(dp),intent(in) :: xi,xf
integer, intent(in) :: length
real(dp),dimension(1:length) :: vec
integer ::i
real(dp) :: increment
increment = (xf-xi)/(real(length)-1)
vec(1) = xi
do i = 2,length
vec(i) = vec(i-1) + increment
end do
end function linspace
end module test
program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix
x = linspace(length,.0d0,1.0d0)
y = linspace(length,2.0d0,3.0d0)
!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
do j = 1,size(y)
call Addition(x(i),y(j),s)
SMatrix(i,j) = s
end do
end do
!$OMP END PARALLEL DO
open(unit=1,file ='Add6.dat')
do i= 1,size(x)
do j= 1,size(y)
write(1,*) x(i),";",y(j),";",SMatrix(i,j)
end do
end do
close(unit=1)
end program paralleltest
I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8
This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.
As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.
Regarding the time, almost all of the runtime is spent in writing the array to the external file.
But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).
As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.
And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).
With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.
If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.
Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

FFTW in Fortran result contains only zeros

I have been trying to write a simple program to perform an fft on a 1D input array using fftw3. Here I am using a seismogram as an input. The output array is, however, coming out to contain only zeroes.
I know that the input is correct as I have tried doing the fft of the same input file in MATLAB as well, which gives correct results. There is no compilation error. I am using f95 to compile this, however, gfortran was also giving pretty much the same results. Here is the code that I wrote:-
program fft
use functions
implicit none
include 'fftw3.f90'
integer nl,row,col
double precision, allocatable :: data(:,:),time(:),amplitude(:)
double complex, allocatable :: out(:)
integer*8 plan
open(1,file='test-seismogram.xy')
nl=nlines(1,'test-seismogram.xy')
allocate(data(nl,2))
allocate(time(nl))
allocate(amplitude(nl))
allocate(out(nl/2+1))
do row = 1,nl
read(1,*,end=101) data(row,1),data(row,2)
amplitude(row)=data(row,2)
end do
101 close(1)
call dfftw_plan_dft_r2c_1d(plan,nl,amplitude,out,FFTW_R2HC,FFTW_PATIENT)
call dfftw_execute_dft_r2c(plan, amplitude, out)
call dfftw_destroy_plan(plan)
do row=1,(nl/2+1)
print *,out(row)
end do
deallocate(data)
deallocate(amplitude)
deallocate(time)
deallocate(out)
end program fft
The nlines() function is a function which is used to calculate the number of lines in a file, and it works correctly. It is defined in the module called functions.
This program pretty much tries to follow the example at http://www.fftw.org/fftw3_doc/Fortran-Examples.html
There might just be a very simple logical error that I am making, but I am seriously unable to figure out what is going wrong here. Any pointers would be very helpful.
This is pretty much how the whole output looks like:-
.
.
.
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
.
.
.
My doubt is directly regarding fftw, since there is a tag for fftw on SO, so I hope this question is not off topic
As explained in the comments first by #roygvib and #Ross, the plan subroutines overwrite the input arrays because they try the transform many times with different parameters. I will add some practical use considerations.
You claim you do care about performance. Then there are two possibilities:
You do the transform only once as you show in your code. Then there is no point to use FFTW_MEASURE. The planning subroutine is many times slower than actual plan execute subroutine. Use FFTW_ESTIMATE and it will be much faster.
FFTW_MEASURE tells FFTW to find an optimized plan by actually
computing several FFTs and measuring their execution time. Depending
on your machine, this can take some time (often a few seconds).
FFTW_MEASURE is the default planning option.
FFTW_ESTIMATE specifies that, instead of actual measurements of
different algorithms, a simple heuristic is used to pick a (probably
sub-optimal) plan quickly. With this flag, the input/output arrays are
not overwritten during planning.
http://www.fftw.org/fftw3_doc/Planner-Flags.html
You do the same transform many times for different data. Then you must do the planning only once before the first transform and than re-use the plan. Just make the plan first and only then you fill the array with the first input data. Making the plan before every transport would make the program extremely slow.

No speedup with OpenMP in DO loop [duplicate]

I have a Fortran 90 program calling a multi threaded routine. I would like to time this program from the calling routine. If I use cpu_time(), I end up getting the cpu_time for all the threads (8 in my case) added together and not the actual time it takes for the program to run. The etime() routine seems to do the same. Any idea on how I can time this program (without using a stopwatch)?
Try omp_get_wtime(); see http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html for the signature.
If this is a one-off thing, then I agree with larsmans, that using gprof or some other profiling is probably the way to go; but I also agree that it is very handy to have coarser timers in your code for timing different phases of the computation. The best timing information you have is the stuff you actually use, and it's hard to beat stuff that's output every single tiem you run your code.
Jeremia Wilcock pointing out omp_get_wtime() is very useful; it's standards compliant so should work on any OpenMP compiler - but it only has second resolution, which may or may not be enough, depending on what you're doing. Edited; the above was completely wrong.
Fortran90 defines system_clock() which can also be used on any standards-compliant compiler; the standard doesn't specify a time resolution, but gfortran it seems to be milliseconds and ifort seems to be microseconds. I usually use it in something like this:
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
And using them:
call tick(calc)
! do big calculation
calctime = tock(calc)
print *,'Timing summary'
print *,'Calc: ', calctime

High Performance Fortran (HPF) without directives?

In High Performance Fortran (HPF), I could specify the distribution of arrays involved in a parallel calculation using the DISTRIBUTE directive. For example, the following minimal subroutine will sum two arrays in parallel:
subroutine mysum(x,y,z)
integer, intent(in) :: y(10000), z(10000)
integer, intent(out) :: x(10000),
!HPF$ DISTRIBUTE x(BLOCK), y(BLOCK), z(BLOCK)
x = y + z
end subroutine mysum
My question is, is the DISTRIBUTE directive necessary? I know in practise this is of little interest, but I'm curious as to whether an unadorned, directive-free, Fortran program could also be a valid HPF program?
I do not believe DISTRIBUTE statement is necessary, and I never used it.
You can achieve this implicitly by using FORALL statements instead of DO loops where applicable. Originally, DO loops would give explicit order of operation on array elements, whereas FORALL would allow the processor to determine an optimal order at runtime. I do not think this makes much difference nowadays, because modern compilers are able to optimize/vectorize/parallelize DO loops where possible. I cannot tell for sure for other compilers, but I remember using Intel Fortran Compiler to compile and run a program on 2 and 4 processors in parallel without using DISTRIBUTE.
However, depending on the processor architecture and compiler, it is best to try out what you have and see what gives you optimal results or efficiency.