I'm modifying an existing Fortran code which uses the openmp library. The original version of this code works perfectly in parallel.
I obtain a segmentation fault when a certain variable is accessed during the multi-thread run (I verified by setting flags all over the code). This array is defined allocatable, then as threadprivate and then allocated, while in the original version it's not an allocatable and its size is set immediately. I modified this part due to the workplan I was given.
Here is a basic piece of code which reproduces the error. The guilty variable is an array here named "var".
program testparallel
use omp_lib
implicit none
integer :: thread_id, thread_num
integer :: i,N
integer,dimension(:),allocatable,save :: var
!$omp threadprivate(var)
N = 20
allocate(var(5))
!$omp parallel default(shared) private(thread_id)
thread_id = omp_get_thread_num()
thread_num = omp_get_num_threads()
write(*,*)'Parallel execution on ',thread_num, ' Threads'
!$omp do
do i=1,N
var = 0
write(*,*) thread_id,i
end do
!$omp end do
!$omp end parallel
end program testparallel
This is how more or less the original code is structured, I didn't modify this part directly. var is initialised within the loop and, according to the inputs, its values are used later by other routines.
This is the error traceback I obtained:
Parallel execution on 2 Threads
0 1
0 2
Parallel execution on 2 Threads
0 3
0 4
0 5
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0 6
Backtrace for this error:
0 7
0 8
0 9
0 10
#0 0x7F0149194697
#1 0x7F0149194CDE
#2 0x7F014824E33F
#3 0x400FB2 in MAIN__._omp_fn.0 at testparallel.F90:?
#4 0x7F0148C693C4
#5 0x7F01485ECDD4
#6 0x7F0148315F6C
#7 0xFFFFFFFFFFFFFFFF
The segfault doesn't occur if I don't define var as allocatable but define its size straighaway (as in the original code). If I allocate it before setting it as threadprivate I get a compilation error.
How can I avoid this error but keep var as allocatable (which is necessary)?
EDIT: I corrected the description of the original code.
Your issue comes from the fact that, although your allocatable array var is declared threadprivate, it is only allocated in the non-parallel part of the code. Therefore, once on a parallel section, only the master thread can safely access to the array.
A very simple fix is to enclose your array allocation (and subsequent de-allocation) within a parallel section like this:
!$omp parallel
allocate(var(5))
!$omp end parallel
Related
I had a serial code where I would declare a bunch of variables in modules and then use those modules across the rest of my program and subroutines. Now I am trying to parallelize this code. There is a portion of the code that I want to run in parallel which seems to be working except for one array, gtmp. I want each thread to have it's own version of gtmp and I want that version to be private to its respective thread, so I've used the threadprivate directive. gtmp is only used inside the parallel region of the code or within subroutines that are only called from the parallel part of the code.
At first I allocated gtmp in a serial portion of the code before the parallel portion, but that was an issue because then only the master thread 'version' of gtmp got allocated and the other thread 'versions' of gtmp had a size of 1 rather than the expected allocated size of gtmp, (this was shown by the "test" print statement). I think this happened because the master thread is the only thread executing code in the serial portions. So, I moved the allocate line into the parallel region, which allowed all threads to have appropriately sized/allocated gtmp arrays, but since my parallel region is inside a loop I get an error when the program tries to allocate gtmp a second time in the second iteration of the r loop.
Note: elsewhere in the code all the other variables in mymod are given values.
Here is a simplified portion of the code that is having the issue:
module mymod
integer :: xBins, zBins, rBins, histCosThBins, histPhiBins, cfgRBins
real(kind=dp),allocatable :: gtmp(:,:,:)
end module mymod
subroutine compute_avg_force
use mymod
implicit none
integer :: r, i, j, ip
integer :: omp_get_thread_num, tid
! I used to allocate 'gtmp' here.
do r = 1, cfgRBins
!$omp PARALLEL DEFAULT( none ) &
!$omp PRIVATE( ip, i, j, tid ) &
!$omp SHARED( r, xBins, zBins, histCosThBins, histPhiBins )
allocate( gtmp(4,0:histCosThBins+1,0:histPhiBins+1) )
tid = omp_get_thread_num() !debug
print*, 'test', tid, histCosThBins, histPhiBins, size(gtmp)
!$omp DO SCHEDULE( guided )
do ip = 1, (xBins*zBins)
call subroutine_where_i_alter_gtmp(...)
...code to be executed in parallel using gtmp...
end do !ip
!$omp END DO
!$omp END PARALLEL
end do !r
end subroutine compute_avg_force
So, the issue is coming from the fact that I need all threads to be active, (ie. in a parallel region), to appropriately initialize all 'versions' of gtmp but my parallel region is inside a loop and I can't allocate gtmp more than once.
In short, what is the correct way to allocate gtmp in this code? I've thought that I could just make another omp parallel region before the loop and use that to allocate gtmp but that seems clunky so I'm wondering what the "right" way to do something like this is.
Thanks for the help!
I have two do-loops inside OpenMP parallel region as follows:
!$OMP PARALLEL
...
!$OMP DO
...
!$OMP END DO
...
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
Let's say OMP_NUM_THREADS=6. I wanted to run first do-loop with 4 threads, and the second do-loop with 3 threads. Can you show how to do it? I want them to be inside one parallel region though. Also is it possible to specify which thread numbers should do either of the do-loops, for example in case of first do-loop I could ask it to use thread numbers 1,2,4, and 5. Thanks.
Well, you can add the num_threads clause to an OpenMP parallel directive but that applies to any directive inside the region. In your case you could split your program into two regions, like
!$OMP PARALLEL DO num_threads(4)
...
!$OMP END PARALLEL DO
...
!$OMP PARALLEL DO num_threads(3)
...
!$OMP END PARALLEL DO
This, of course, is precisely what you say you don't want to do, have only one parallel region. But there is no mechanism for throttling the number of threads in use inside a parallel region. Personally I can't see why anyone would want to do that.
As for assigning parts of the computation to particular threads, again, no, OpenMP does not provide a mechanism for doing that and why would you want to ?
I suppose that I am dreadfully conventional, but when I see signs of parallel programs where the programmer has tried to take precise control over individual threads, I usually see a program with one or more of the following characteristics:
OpenMP directives are used to ensure that the code runs in serial with the result that run time exceeds that of the original serial code;
the program is incorrect because the programmer has failed to deal correctly with the subtleties of data races;
it has been carefully arranged to run only on a specific number of threads.
None of these is desirable in a parallel program and if you want the level of control over numbers of threads and the allocation of work to individual threads you will have to use a lower-level approach than OpenMP provides. Such approaches abound so giving up OpenMP should not limit you.
What you want cannot be achieved with the existing OpenMP constructs but only manually. Imagine that the original parallel loop was:
!$OMP DO
DO i = 1, 100
...
END DO
!$OMP END DO
The modified version with custom selection of the participating threads would be:
USE OMP_LIB
INTEGER, DIMENSION(:), ALLOCATABLE :: threads
INTEGER :: tid, i, imin, imax, tidx
! IDs of threads that should execute the loop
! Make sure no repeated items inside
threads = (/ 0, 1, 3, 4 /)
IF (MAXVAL(threads, 1) >= omp_get_max_threads()) THEN
STOP 'Error: insufficient number of OpenMP threads'
END IF
!$OMP PARALLEL PRIVATE(tid,i,imin,imax,tidx)
! Get current thread's ID
tid = omp_get_thread_num()
...
! Check if current thread should execute part of the loop
IF (ANY(threads == tid)) THEN
! Find out what thread's index is
tidx = MAXLOC(threads, 1, threads == tid)
! Compute iteration range based on the thread index
imin = 1 + ((100-1 + 1)*(tidx-1))/SIZE(threads)
imax = 1 + ((100-1 + 1)*tidx)/SIZE(threads) - 1
PRINT *, 'Thread', tid, imin, imax
DO i = imin, imax
...
END DO
ELSE
PRINT *, 'Thread', tid, 'not taking part'
END IF
! This simulates the barrier at the end of the worksharing construct
! Remove in order to implement the "nowait" clause
!$OMP BARRIER
...
!$OMP END PARALLEL
Here are three example executions:
$ OMP_NUM_THREADS=2 ./custom_loop.x | sort
STOP Error: insufficient number of OpenMP threads
$ OMP_NUM_THREADS=5 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
$ OMP_NUM_THREADS=7 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
Thread 5 not taking part
Thread 6 not taking part
Note that this is an awful hack and goes against the basic premises of the OpenMP model. I would strongly advise against doing it and relying on certain threads to execute certain portions of the code as it creates highly non-portable programs and hinders runtime optimisations.
If you decide to abandon the idea of explicitly assigning the threads that should execute the loop and only want to dynamically change the number of threads, then the chunk size parameter in the SCHEDULE clause is your friend:
!$OMP PARALLEL
...
! 2 threads = 10 iterations / 5 iterations/chunk
!$OMP DO SCHEDULE(static,5)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
! 10 threads = 10 iterations / 1 iteration/chunk
!$OMP DO SCHEDULE(static,1)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
!$OMP END PARALLEL
And the output with 10 threads:
$ OMP_NUM_THREADS=10 ./loop_chunks.x | sort_manually :)
First loop
Iteration Thread ID
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
Second loop
Iteration Thread ID
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9
The number of files that are getting written is always less than the number of threads. Logically for me, when I can have 4 threads and the CPU is working at 400%, I was expecting the number of files to be 4 (one each corresponding to every single thread). I don't know if there is a problem with my code or this is how it is supposed to work. The code is as follows:
!!!!!!!! module
module common
use iso_fortran_env
implicit none
integer,parameter:: dp=real64
real(dp):: aa,bb
contains
subroutine evolve(y,yevl)
implicit none
integer(dp),parameter:: id=2
real(dp),intent(in):: y(id)
real(dp),intent(out):: yevl(id)
yevl(1)=y(2)+1.d0-aa*y(1)**2
yevl(2)=bb*y(1)
end subroutine evolve
end module common
use common
implicit none
integer(dp):: iii,iter,i
integer(dp),parameter:: id=2
real(dp),allocatable:: y(:),yt(:)
integer(dp):: OMP_GET_THREAD_NUM, IXD
allocate(y(id)); allocate(yt(id)); y=0.d0; yt=0.d0; bb=0.3d0
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
!$OMP DO
do iii=1,20000; print*,iii !! EXPECTED THREADS TO BE OF 5000 ITERATIONS EACH
aa=1.d0+dfloat(iii-1)*0.4d0/80000.d0
loop1: do iter=1,10 !! THE INITIAL CONDITION LOOP
call random_number(y)!! RANDOM INITIALIZATION OF THE VARIABLE
loop2: do i=1,70000 !! ITERATION OF THE SYSTEM
call evolve(y,yt)
y=yt
enddo loop2 !! END OF SYSTEM ITERATION
write(IXD+1,*)aa,yt !!! WRITING FILE CORRESPONDING TO EACH THREAD
enddo loop1 !!INITIAL CONDITION ITERATION DONE
enddo
!$OMP ENDDO
!$OMP END PARALLEL
end
Is this behavior resulting from some race issue in the code? The code compiles and executes just fine without any warnings or errors with ifort version 13.1.0 on ubuntu. Thanks a bunch for any comments or suggestions.
The variable IXD should be explicitely declared as private to make sure every thread has an own copy of it. Changing the line(s)
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
to
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt,ixd) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
solves the problem.
I encounter a problem with OpenMP and shared variables I cannot understand. Everything I do is in Fortran 90/95.
Here is my problem: I have a parallel region defined in my main program, with the clause DEFAULT(SHARED), in which I call a subroutine that does some computation. I have a local variable (an array) I allocate and on which I do the computations. I was expecting this array to be shared (because of the DEFAULT(SHARED) clause), but it seems that it is not the case.
Here is an example of what I am trying to do and that reproduce the error I get:
program main
!$ use OMP_LIB
implicit none
integer, parameter :: nx=10, ny=10
real(8), dimension(:,:), allocatable :: array
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP SINGLE
allocate(array(nx,ny))
!$OMP END SINGLE
!$OMP WORKSHARE
array = 1.
!$OMP END WORKSHARE
call compute(array,nx,ny)
!$OMP SINGLE
deallocate(array)
!$OMP END SINGLE
!$OMP END PARALLEL
contains
!=============================================================================
! SUBROUTINES
!=============================================================================
subroutine compute(array, nx, ny)
!$ use OMP_LIB
implicit none
real(8), dimension(nx,ny) :: array
integer :: nx, ny
real(8), dimension(:,:), allocatable :: q
integer :: i, j
!$OMP SINGLE
allocate(q(nx,ny))
!$OMP END SINGLE
!$OMP WORKSHARE
q = 0.
!$OMP END WORKSHARE
print*, 'q before: ', q(1,1)
!$OMP DO SCHEDULE(RUNTIME)
do j = 1, ny
do i = 1, nx
if(mod(i,j).eq.0) then
q(i,j) = array(i,j)*2.
else
q(i,j) = array(i,j)*0.5
endif
end do
end do
!$OMP END DO
print*, 'q after: ', q(1,1)
!$OMP SINGLE
deallocate(q)
!$OMP END SINGLE
end subroutine compute
!=============================================================================
end program main
When I execute it like that, I get a segmentation fault, because the local array q is allocated on one thread but not on the others, and when the others try to access it in memory, it crashes.
If I get rid of the SINGLE region the local array q is allocated (though sometimes it crashes, which make sense, if different threads try to allocate it whereas it is already the case (and actually it puzzles me why it does not crash everytime)) but then it is clearly as if the array q is private (therefore one thread returns me the expected value, whereas the others return me something else).
It really puzzled me why the q array is not shared although I declared my parallel region with the clause DEFAULT(SHARED). And since I am in an orphaned subroutine, I cannot declare explicitely q as shared, since it is known only in the subroutine compute... I am stuck with this problem so far, I could not find a workaround.
Is it normal? Should I expect this behaviour? Is there a workaround? Do I miss something obvious?
Any help would be highly appreciated!
q is an entity that is "inside a region but not inside a construct" in terms of OpenMP speak. The subroutine that q is local to is in a procedure that is called during a parallel construct, but q itself does not lexically appear in between the PARALLEL and END PARALLEL directives.
The data sharing rules for such entities in OpenMP then dictate that q is private.
The data sharing clauses such as DEFAULT(SHARED), etc only apply to things that appear in the construct itself (things that lexically appear in between the PARALLEL and END PARALLEL). (They can't apply to things in the region generally - procedures called in the region may have been separately compiled and might be called outside of any parallel constructs.)
The array q is defined INSIDE the called subroutine. Every thread calls this subroutine independently and therefore every thread will have it's own copy. The shared directive in the outer subroutine cannot change this. Try to declare it with the save attribute.
I'm writing a matrix multiplication subroutine in Fortran. I'm using the Intel Fortran compiler. I've written a simple static scheduled parallel do-loop. Unfortunately, it's running on only one thread. Here's the code:
SUBROUTINE MATMULT(A,B,C,L,M,N)
REAL*8 A,B,C
INTEGER NCORES, CHUNK, TID
DIMENSION A(L,N),B(L,M),C(M,N)
PARAMETER (NCORES=8)
CHUNK=(L/(NCORES+1))+1
TID=0
!$OMP PARALLELDO SHARED(A,B,C,L,M,N,CHUNK) PRIVATE(I,J,K,TID)
!$OMP+DEFAULT(NONE) SCHEDULE(STATIC,CHUNK)
DO I=1,L
TID = OMP_GET_THREAD_NUM()
PRINT *, "THREAD ", TID, " ON I=", I
DO K=1,N
DO J=1,M
A(I,K) = A(I,K) + B(I,J)*C(J,K)
END DO
END DO
END DO
!$OMP END PARALLELDO
RETURN
END
Note:
There are no parallel directives in the main program that calls the routine
The arrays A,B,C are initialized serially in the main program. A is initialized to zeros
I am enforcing the Fortran fixed source form during compilation
I have confirmed the following:
Another example program works fine with 8 threads (so no hardware issue)
I have used the -openmp compiler argument
OMP_GET_NUM_PROCS() and OMP_GET_MAX_THREADS() both return 0
TID is 0 for every iteration over I (which shouldn't be the case)
I am unable to diagnose my mistake. I'd appreciate any inputs on this.
The identifier OMP_GET_THREAD_NUM is not explicitly declared. The default implicit typing rules mean it will be of type real. That's not consistent with the declaration in the OpenMP spec for the function of that name.
Adding USE OMP_LIB would fix that issue. Further, not using implicit typing (IMPLICIT NONE) would avoid this and a multitude of similar problems.