Fortran: segmentation fault

Fortran: segmentation fault - fortran

I know I once made a similar topic, but that one was different. This time, adding a print statement does not change whether or not I get a segfault.
call omp_set_num_threads(omp_get_max_threads())
!$omp parallel do &
!$omp default(firstprivate) &
!$omp private(present_surface) &
!$omp lastprivate(fermi)
do m = 1, fourth
do n = 1, third
do j = 1, second
do i = 1, first
!current angle is phi[i,j,ii,jj]
!we have to find the current fermi surface
present_surface = 0.
do a = 1, fourth
if (angle(i,j,n,m) == angle_array(a)) then
present_surface = surface(a)
end if
end do
if (radius(i,j,n,m) >= present_surface) then
fermi(i,j,n,m) = 0.
else
fermi(i,j,n,m) = 1.
end if
end do
end do
end do
end do
!$omp end parallel do
I'm not sure about the lastprivate(fermi) statement, but it doesn't matter at this moment. shared gives the same behaviour.
So, I run this script with 'first, second, third, fourth' being increased. Typical output:
[10] Time elapsed is 0.001 [s].
[15] Time elapsed is 0.002 [s].
[20] Time elapsed is 0.005 [s].
./compile: line 2: 4310 Segmentation fault python fortran_test.py
Well, how to proceed. I looked at gdb python; run fortran_test.py, and found:
(gdb) run fortran_test.py
Starting program: /usr/bin/python fortran_test.py
[Thread debugging using libthread_db enabled]
[New Thread 0x7fffed9b0700 (LWP 4251)]
[New Thread 0x7fffe8ba5700 (LWP 4252)]
[New Thread 0x7fffe83a4700 (LWP 4253)]
[New Thread 0x7fffe7ba3700 (LWP 4254)]
[10] Time elapsed is 0.008 [s].
[15] Time elapsed is 0.004 [s].
[20] Time elapsed is 0.005 [s].
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe7ba3700 (LWP 4254)]
0x00007fffe8dc0bb7 in __populate_MOD_fermi_integrand._omp_fn.0 () at populate.f90:31
31 do n = 1, third
If I change things in the inner loop - for instance, removing the j,i loops and putting them to a constant - then I just get the segfault at a different line.
I think this has something to do with memory, because it triggers as N increases. However, the things I've tried (export GOMP_STACKSIZE, OMP_STACKSIZE, ulimit) haven't fixed it, and I see no difference using them. (For the time being, I removed them).
Finally, the command to compile (it is in f2py):
f2py --fcompiler=gfortran --f90flags="-fopenmp -g" -lgomp -m -c populate populate.f90
As you can see, I'm quite stuck. I hope some of you know how to solve the problem.
My aim is to have N=100 run quickly (hence the openMP fortran function), but this shouldn't matter for the fortran code, should it?
If you're wondering, 4GB ram and my swap is 3.1G (Linux Swap / Solaris).
Thank you!

default(firstprivate) lastprivate(fermi) means that each thread receives a private copy of angle, radius, and fermi, with each being of size first x second x third x fourth. In my experience, private arrays are always allocated on the stack, even when the option to automatically turn big arrays into heap arrays is given. Your code most likely crashes because of insufficient stack space.
Instead of having everything private, you should really take a look at the data dependency and pick the correct data-sharing classes. angle, angle_array, surface, and radius are never written to, therefore they all should be shared. present_surface is being modified and should be private. fermi is written to, but never at the same location by more than one thread, therefore it should also be shared. Also, note that lastprivate won't have the expected result as it makes available the value of the corresponding variable that it has in the thread that executes the logically last iteration of the m-loop. first, second, third, and fourth are simply treated as constants and should be shared. The loop variables, of course, are private, but that is automated by the compiler.
!$omp parallel do private(present_surface)
do m = 1, fourth
do n = 1, third
do j = 1, second
do i = 1, first
!current angle is phi[i,j,ii,jj]
!we have to find the current fermi surface
present_surface = 0.
do a = 1, fourth
if (angle(i,j,n,m) == angle_array(a)) then
present_surface = surface(a)
end if
end do
if (radius(i,j,n,m) >= present_surface) then
fermi(i,j,n,m) = 0.
else
fermi(i,j,n,m) = 1.
end if
end do
end do
end do
end do
!$omp end parallel do

A very possible reason is the stack limit. First run ulimit -s to check the stack limit for the process. You can use ulimit -s unlimited to set it as ulimited. Then if it still crashes, try to increase the stack for OPENMP by setting OMP_STACKSIZE environmental variable to a huge value, like 100MB.
Intel has a discussion at https://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors. It has more information of stack and heap memory.

Related

Why accessing a threadprivate variable causes segmentation fault?

I'm modifying an existing Fortran code which uses the openmp library. The original version of this code works perfectly in parallel.
I obtain a segmentation fault when a certain variable is accessed during the multi-thread run (I verified by setting flags all over the code). This array is defined allocatable, then as threadprivate and then allocated, while in the original version it's not an allocatable and its size is set immediately. I modified this part due to the workplan I was given.
Here is a basic piece of code which reproduces the error. The guilty variable is an array here named "var".
program testparallel
use omp_lib
implicit none
integer :: thread_id, thread_num
integer :: i,N
integer,dimension(:),allocatable,save :: var
!$omp threadprivate(var)
N = 20
allocate(var(5))
!$omp parallel default(shared) private(thread_id)
thread_id = omp_get_thread_num()
thread_num = omp_get_num_threads()
write(*,*)'Parallel execution on ',thread_num, ' Threads'
!$omp do
do i=1,N
var = 0
write(*,*) thread_id,i
end do
!$omp end do
!$omp end parallel
end program testparallel
This is how more or less the original code is structured, I didn't modify this part directly. var is initialised within the loop and, according to the inputs, its values are used later by other routines.
This is the error traceback I obtained:
Parallel execution on 2 Threads
0 1
0 2
Parallel execution on 2 Threads
0 3
0 4
0 5
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0 6
Backtrace for this error:
0 7
0 8
0 9
0 10
#0 0x7F0149194697
#1 0x7F0149194CDE
#2 0x7F014824E33F
#3 0x400FB2 in MAIN__._omp_fn.0 at testparallel.F90:?
#4 0x7F0148C693C4
#5 0x7F01485ECDD4
#6 0x7F0148315F6C
#7 0xFFFFFFFFFFFFFFFF
The segfault doesn't occur if I don't define var as allocatable but define its size straighaway (as in the original code). If I allocate it before setting it as threadprivate I get a compilation error.
How can I avoid this error but keep var as allocatable (which is necessary)?
EDIT: I corrected the description of the original code.

Your issue comes from the fact that, although your allocatable array var is declared threadprivate, it is only allocated in the non-parallel part of the code. Therefore, once on a parallel section, only the master thread can safely access to the array.
A very simple fix is to enclose your array allocation (and subsequent de-allocation) within a parallel section like this:
!$omp parallel
allocate(var(5))
!$omp end parallel

Two openmp ordered blocks with no resulting parallelization

I am writing a Fortran program that needs to have reproducible results (for publication). My understanding of the following code is that it should be reproducible.
program main
implicit none
real(8) :: ybest,xbest,x,y
integer :: i
ybest = huge(0d0)
!$omp parallel do ordered private(x,y) shared(ybest,xbest) schedule(static,1)
do i = 1,10
!$omp ordered
!$omp critical
call random_number(x)
!$omp end critical
!$omp end ordered
! Do a lot of work
call sleep(1)
y = -1d0
!$omp ordered
!$omp critical
if (y<ybest) then
ybest = y
xbest = x
end if
!$omp end critical
!$omp end ordered
end do
!$omp end parallel do
end program
In my case, there is a function in place of "sleep" that takes long time to compute, and I want it done in parallel. According to OpenMP standards, should sleep in this example execute in parallel? I thought it should be (based on this How does the omp ordered clause work?), but with gfortran 5.2.0 (mac) and gfortran 5.1.0 (linux) it is not executing in parallel (at least, there is no speedup from it). The timing results are below.
Also, my guess is the critical statements are not necessary, but I wasn't completely sure.
Thanks.
-Edit-
In response to Vladmir's comments, I added a full working program with timing results.
#!/bin/bash
mpif90 main.f90
time ./a.out
mpif90 main.f90 -fopenmp
time ./a.out
The code runs as
real 0m10.047s
user 0m0.003s
sys 0m0.003s
real 0m10.037s
user 0m0.003s
sys 0m0.004s
BUT, if you comment out the ordered blocks, it runs with the following times:
real 0m10.044s
user 0m0.002s
sys 0m0.003s
real 0m3.021s
user 0m0.002s
sys 0m0.004s
Edit -
In response to innoSPG, here are the results for a non-trivial function in place of sleep:
real(8) function f(x)
implicit none
real(8), intent(in) :: x
! local
real(8) :: tmp
integer :: i
tmp = 0d0
do i = 1,10000000
tmp = tmp + cos(sin(x))/real(i,8)
end do
f = tmp
end function
real 0m2.229s --- no openmp
real 0m2.251s --- with openmp and ordered
real 0m0.773s --- with openmp but ordered commented out

This program is non-conforming to the OpenMP standard. Specifically, the problem is that you have more than one ordered region and every iteration of your loop will execute both of them. The OpenMP 4.0 standard has this to say (2.12.8, Restrictions, line 16, p 139):
During execution of an iteration of a loop or a loop nest within a loop region, a thread must not execute more than one ordered region that binds to the same loop
region.
If you have more than one ordered region, you must have conditional code paths such that only one of them can be executed for any loop iteration.
It is also worth noting the position of your ordered region seems to have performance implications. Testing with gfortran 5.2, it appears everything after the ordered region is executed in order for each loop iteration, so having the ordered block at the beginning of the loop leads to serial performance while having the ordered block at the end of the loop does not have this implication as the code before the block is parallelized. Testing with ifort 15 is not as dramatic but I would still recommend structuring your code so your ordered block occurs after any code than needs parallelization in a loop iteration rather than before.

Segmentation fault when enable $OMP DO loop

I am trying to modifying legacy code to initialize array with openmp. However, I encounter Segmentation fault when enabling $OMP DO derivatives in the following code sections. Would you please point out what might be wrong?
I am using fortran and compile with gfortran and variables are declared as common variables
common/quant/keosc,vosc,rosc,frt,grt,dipole,v_solv
common/quant_avg/frt_avg,grt_avg,d_coup,rv_avg,b_avg
!$OMP PARALLEL
!$OMP DO private(m,j,l,mp) firstprivate(nstates,natoms) lastprivate(rv_avg,b_avg,grt_avg,frt_avg,d_coup)
do m = 0, nstates - 1
rv_avg(m) = 0d0
b_avg(m) = 0d0
do j = 1, 3
grt_avg(m,j) = 0d0
do l = 1, natoms
frt_avg(m,l,j) = 0d0
do mp = 0, nstates - 1
d_coup(m,mp,l,j) = 0d0
enddo
enddo
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL

Have you measured where the CPU consumption is in your program? It is a waste of effort to speed up portions that don't consume much CPU time. I'd be surprised if array initializations were a high fraction of the CPU usage. The code would be more readable if instead you used array notation, e.g., rv_avg (0:nstates - 1) = 0d0.

You haven't shown your declaration of the dimensions of any of the arrays so I speculate that the lines
do m = 0, nstates - 1
rv_avg(m) = 0d0
write to a non-existent element of rv_avg, that is the element at index 0. Since Fortran programs don't, by default, check that array element accesses are within bounds, this write outside the bounds won't be caught by the run-time. If the write stays within the address space of the program when it executes it won't cause a segmentation fault. Given the common block declarations the 0-th element of rv_avg may well be part of d_coup.
Shake up the mapping of variables to address space by introducing OpenMP and it's easy to believe that the0-th element of rv_avg now lies outside the address space for a thread and causes the segmentation fault.
Since the program makes other references to array elements at 0 any one of them might be at the root of the segmentation fault.
Of course, if you follow #M.S.B.'s advice and use array syntax notation you can avoid out-of-bounds array accesses.

The problem is probably that you do not have enough stack space in the OpenMP threads to hold the private copies of all these arrays. Especially d_coup looks like a really big one having 3 x natoms x nstates^2 elements. Most Fortran compilers nowadays automatically resort to using heap allocation for such big arrays but when it comes to (first|last)private variables, some OpenMP compilers, including GCC and Intel Fortran Compiler, always place them on the stack. See my answer here for more information.
Edit: Now I see that M. S. B. has actually linked to that same question in his comment.

Assigning different thread numbers in OpenMP do-loops

I have two do-loops inside OpenMP parallel region as follows:
!$OMP PARALLEL
...
!$OMP DO
...
!$OMP END DO
...
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
Let's say OMP_NUM_THREADS=6. I wanted to run first do-loop with 4 threads, and the second do-loop with 3 threads. Can you show how to do it? I want them to be inside one parallel region though. Also is it possible to specify which thread numbers should do either of the do-loops, for example in case of first do-loop I could ask it to use thread numbers 1,2,4, and 5. Thanks.

Well, you can add the num_threads clause to an OpenMP parallel directive but that applies to any directive inside the region. In your case you could split your program into two regions, like
!$OMP PARALLEL DO num_threads(4)
...
!$OMP END PARALLEL DO
...
!$OMP PARALLEL DO num_threads(3)
...
!$OMP END PARALLEL DO
This, of course, is precisely what you say you don't want to do, have only one parallel region. But there is no mechanism for throttling the number of threads in use inside a parallel region. Personally I can't see why anyone would want to do that.
As for assigning parts of the computation to particular threads, again, no, OpenMP does not provide a mechanism for doing that and why would you want to ?
I suppose that I am dreadfully conventional, but when I see signs of parallel programs where the programmer has tried to take precise control over individual threads, I usually see a program with one or more of the following characteristics:
OpenMP directives are used to ensure that the code runs in serial with the result that run time exceeds that of the original serial code;
the program is incorrect because the programmer has failed to deal correctly with the subtleties of data races;
it has been carefully arranged to run only on a specific number of threads.
None of these is desirable in a parallel program and if you want the level of control over numbers of threads and the allocation of work to individual threads you will have to use a lower-level approach than OpenMP provides. Such approaches abound so giving up OpenMP should not limit you.

What you want cannot be achieved with the existing OpenMP constructs but only manually. Imagine that the original parallel loop was:
!$OMP DO
DO i = 1, 100
...
END DO
!$OMP END DO
The modified version with custom selection of the participating threads would be:
USE OMP_LIB
INTEGER, DIMENSION(:), ALLOCATABLE :: threads
INTEGER :: tid, i, imin, imax, tidx
! IDs of threads that should execute the loop
! Make sure no repeated items inside
threads = (/ 0, 1, 3, 4 /)
IF (MAXVAL(threads, 1) >= omp_get_max_threads()) THEN
STOP 'Error: insufficient number of OpenMP threads'
END IF
!$OMP PARALLEL PRIVATE(tid,i,imin,imax,tidx)
! Get current thread's ID
tid = omp_get_thread_num()
...
! Check if current thread should execute part of the loop
IF (ANY(threads == tid)) THEN
! Find out what thread's index is
tidx = MAXLOC(threads, 1, threads == tid)
! Compute iteration range based on the thread index
imin = 1 + ((100-1 + 1)*(tidx-1))/SIZE(threads)
imax = 1 + ((100-1 + 1)*tidx)/SIZE(threads) - 1
PRINT *, 'Thread', tid, imin, imax
DO i = imin, imax
...
END DO
ELSE
PRINT *, 'Thread', tid, 'not taking part'
END IF
! This simulates the barrier at the end of the worksharing construct
! Remove in order to implement the "nowait" clause
!$OMP BARRIER
...
!$OMP END PARALLEL
Here are three example executions:
$ OMP_NUM_THREADS=2 ./custom_loop.x | sort
STOP Error: insufficient number of OpenMP threads
$ OMP_NUM_THREADS=5 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
$ OMP_NUM_THREADS=7 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
Thread 5 not taking part
Thread 6 not taking part
Note that this is an awful hack and goes against the basic premises of the OpenMP model. I would strongly advise against doing it and relying on certain threads to execute certain portions of the code as it creates highly non-portable programs and hinders runtime optimisations.
If you decide to abandon the idea of explicitly assigning the threads that should execute the loop and only want to dynamically change the number of threads, then the chunk size parameter in the SCHEDULE clause is your friend:
!$OMP PARALLEL
...
! 2 threads = 10 iterations / 5 iterations/chunk
!$OMP DO SCHEDULE(static,5)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
! 10 threads = 10 iterations / 1 iteration/chunk
!$OMP DO SCHEDULE(static,1)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
!$OMP END PARALLEL
And the output with 10 threads:
$ OMP_NUM_THREADS=10 ./loop_chunks.x | sort_manually :)
First loop
Iteration Thread ID
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
Second loop
Iteration Thread ID
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9

Fortran + Openmp more slow that sequential

I have this sequential code in Fortran. My problem is, when I put Openmp directives, the paralleled code is more slow than the sequential, and I don't see the error.
REAL, DIMENSION(:), ALLOCATABLE :: current, next
ALLOCATE ( current(TOTAL_Z), next(TOTAL_Z))
CALL CPU_TIME(t1)
!$OMP PARALLEL SHARED (current, next) PRIVATE (z)
DO t = 1, TOTAL_TIME
!$OMP DO SCHEDULE(STATIC, 2)
DO z = 2, (TOTAL_Z - 1)
next(z) = current (z) + KAPPA*DELTA_T*((current(z - 1) - 2.0*current(z) + current(z + 1)) / DELTA_Z**2)
END DO
!$OMP END DO
current = next
END DO
CALL CPU_TIME(t2)
!$OMP END PARALLEL
TOTAL_Z, TOTAL_TIME, KAPPA, DELTA_T, DELTA_Z are constants.
When I run the paralleled code, I see in htop and my 2 cores are working at 100%
In sequential code, CPU_TIME is 79 seg and in paralleled is 132 seg
Thank

I've just been experiencing the same problem.
It seems that using cpu_time() is not suitable to measure the performance of multi-threaded code. cpu_time() will add the total time of all the threads which is likely to increase with increasing number of threads.
I've found this in another forum,
http://software.intel.com/en-us/forums/topic/281897
You should use system_clock() or omp_get_wtime() functions to get a more accurate timing of your routine.

It is probably slow because of the threads are contending to access the shared variables. If you can change it to use reduction it would likely be faster. But that might not be easy since the calculation for "current" accesses multiple array elements.

depending on the number of iterations, you might also be facing a problem with false-sharing on the nest array. Since the chunk size for the distribution of the DO loop is rather small, the cache line for nest(z), nest(z+1), nest(z+2), nest(z+3), etc might be thrashing between the L1/L2 caches of the CPU.
Cheers,
-michael

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js