Segmentation fault when enable $OMP DO loop - fortran

I am trying to modifying legacy code to initialize array with openmp. However, I encounter Segmentation fault when enabling $OMP DO derivatives in the following code sections. Would you please point out what might be wrong?
I am using fortran and compile with gfortran and variables are declared as common variables
common/quant/keosc,vosc,rosc,frt,grt,dipole,v_solv
common/quant_avg/frt_avg,grt_avg,d_coup,rv_avg,b_avg
!$OMP PARALLEL
!$OMP DO private(m,j,l,mp) firstprivate(nstates,natoms) lastprivate(rv_avg,b_avg,grt_avg,frt_avg,d_coup)
do m = 0, nstates - 1
rv_avg(m) = 0d0
b_avg(m) = 0d0
do j = 1, 3
grt_avg(m,j) = 0d0
do l = 1, natoms
frt_avg(m,l,j) = 0d0
do mp = 0, nstates - 1
d_coup(m,mp,l,j) = 0d0
enddo
enddo
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL

Have you measured where the CPU consumption is in your program? It is a waste of effort to speed up portions that don't consume much CPU time. I'd be surprised if array initializations were a high fraction of the CPU usage. The code would be more readable if instead you used array notation, e.g., rv_avg (0:nstates - 1) = 0d0.

You haven't shown your declaration of the dimensions of any of the arrays so I speculate that the lines
do m = 0, nstates - 1
rv_avg(m) = 0d0
write to a non-existent element of rv_avg, that is the element at index 0. Since Fortran programs don't, by default, check that array element accesses are within bounds, this write outside the bounds won't be caught by the run-time. If the write stays within the address space of the program when it executes it won't cause a segmentation fault. Given the common block declarations the 0-th element of rv_avg may well be part of d_coup.
Shake up the mapping of variables to address space by introducing OpenMP and it's easy to believe that the0-th element of rv_avg now lies outside the address space for a thread and causes the segmentation fault.
Since the program makes other references to array elements at 0 any one of them might be at the root of the segmentation fault.
Of course, if you follow #M.S.B.'s advice and use array syntax notation you can avoid out-of-bounds array accesses.

The problem is probably that you do not have enough stack space in the OpenMP threads to hold the private copies of all these arrays. Especially d_coup looks like a really big one having 3 x natoms x nstates^2 elements. Most Fortran compilers nowadays automatically resort to using heap allocation for such big arrays but when it comes to (first|last)private variables, some OpenMP compilers, including GCC and Intel Fortran Compiler, always place them on the stack. See my answer here for more information.
Edit: Now I see that M. S. B. has actually linked to that same question in his comment.

Related

Calling subroutine in parallel environment

I think my problem is related or even identical to the problem described here. But I don't understand what's actually happening.
I'm using openMP with the gfortran compiler and I have the following task to do: I have a density distribution F(X, Y) on a two-dimensional surface with x-coordinates X and y-coordinates Y. The matrix F has the size Nx x Ny.
I now have a set of coordinates Xp(i) and Yp(i) and I need to interpolate the density F onto these points. This problem is made for parallelization.
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
Everything is shared except for i. The function interp2d is doing some simple linear interpolation.
That works fine with one thread but fails with multithreading. I traced the problem down to the hunt-subroutine taken from Numerical Recipes, which gets called by interp2d. The hunt-subroutine basically calculates the index ix such that X(ix) <= Xp(i) < X(ix+1). This is needed to get the starting point for the interpolation.
With multithreading it happens every now and then, that one threads gets the correct index ix from hunt and the thread, that calls hunt next gets the exact same index, even though Xp(i) is not even close to that point.
I can prevent this by using the CRITICAL environment:
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
!$OMP CRITICAL
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
!$OMP END CRITICAL
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
But this decreases the efficiency. If I use for example three threads, I have a load average of 1.5 with the CRITICAL environment. Without I have a load average of 2.75, but wrong results and even sometimes a SIGSEGV runtime error.
What exactly is happening here? It seems to me that all the threads are calling the same hunt-subroutine and if they do it at the same time there is a conflict. Does that make sense?
How can I prevent this?
Combining variable declaration and initialisation in Fortran 90+ has the side effect of giving the variable the SAVE attribute.
integer :: i = 0
is roughly equivalent to:
integer, save :: i
if (first_invocation) then
i = 0
end if
SAVE'd variables retain their value between multiple invocations of the routine and are therefore often implemented as static variables. By the rules governing the implicit data sharing classes in OpenMP, such variables are shared unless listed in a threadprivate directive.
OpenMP mandates that compliant compilers should apply the above semantics even when the underlying language is Fortran 77.

Using OpenMP critical and ordered

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.
The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.
Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.
So my question is: why doesn't a critical section work in this case?
My code is below. All shared variables except np, tm, hm, gam are read-only.
EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!
TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.
!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
DO i=1,nd
!$OMP CRITICAL (main)
! Second loop over the data:
DO j=1,nd
! The lag:
zdis = z(j) - z(i)
IF(zdis >= 0.0) THEN
izl = INT( zdis/dzlag+0.5)
ELSE
izl = -INT(-zdis/dzlag+0.5)
END IF
! ---- SNIP ----
! Loop over all variograms for this lag:
DO cur_variogram=1,nvarg
variogram_type = ivtype(cur_variogram)
! Get the head and tail values:
indx = i+(ivhead(cur_variogram)-1)*maxdim
vrh = vr(indx)
indx = j+(ivtail(cur_variogram)-1)*maxdim
vrt = vr(indx)
IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE
! ----- PROBLEM AREA -------
np(ixl,iyl,izl,1) = np(ixl,iyl,izl,1) + 1. ! <-- This never fails
tm(ixl,iyl,izl,1) = tm(ixl,iyl,izl,1) + vrt
hm(ixl,iyl,izl,1) = hm(ixl,iyl,izl,1) + vrh
gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
! ----- END OF PROBLEM AREA -----
!CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
END DO
END DO
!$OMP END CRITICAL (main)
END DO
!$OMP END DO
!$OMP END PARALLEL
Thanks very much in advance!
If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.
Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.
The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.
I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.
High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).
float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c); //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000
But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end
Edit:
Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.

Replace do loop with array notation in fortran

I would like to replace the following do loop with FORTRAN's intrinsic functions and array notations.
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
However, as c(i,j) depends on c(i-1,j) none of following trials worked. Because they do not update c(i,j)
!FORALL(i = 2:n , j = 2:n ) c(i,j)=c(i-1,j)*(b(j)-b(j-1))+d(i,j)
!FORALL(i = 2:n) c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
!c(2:n,2:n)=RESHAPE( (/(c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=RESHAPE((/(((b(j)-b(j-1)) *c(i-1,j)+d(i,j) ,j=2,n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=spread(b(2:n)-b(1:n-1),ncopies = n-1,dim=1) * c(1:n-1,2:n) +d(2:n,2:n)
This is the best I can get. But it still has a do loop
do i=2, n
c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
end do
Could all do loops be replaced by intrinsic functions and array notation. Or could this one be replaced somehow ?
In my experience, nothing beats the traditional do-loop. All the extension intrinsics create memory and CPU overhead by copying stuff to temporary space (on the stack usually), reshaping and the sort. If you're manipulating large arrays, you may encounter out-of-memory issues with the intrinsic functions.
Your best option is to stick to a 2-d loop that has the indexes correctly laid out:
do i=2, n
e = c(1:n,i-1)
do j=2, n
a=b(j)-b(j-1)
c(j,i)=a*e(j)+d(j,i)
end do
end do
By replacing the indexes (and making sure your dimension declarations follow), you are saving on the memory paging. c(j,i) and d(j,i) references travel column-wise inside memory while c(j,i-1) would have cut across columns (and produced paging overhead). So we copy it to a temporary e array.
I think this will be the fastest....
Starting with
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
we can quickly eliminate the do loops with some nifty use of SIZE, SPREAD and EOSHIFT:
res = SPREAD(b - EOSHIFT(b,-1),2,SIZE(c,2))*EOSHIFT(c,-1) + d
Aha, turns out that the error I was receiving (V1) was due to my using RESHAPE rather than SPREAD. I fixed this in the current version (V2) and it compiles & works with both ifort and gfortran.

Stalling at deallocate

My 2D hydro code stalls during the following subroutine (which computes the y-direction flux):
ALLOCATE(W1d(1:my,nFields),q1d(nFields),&
Wl(1:my,nFields),Wr(1:my,nFields))
PRINT *,"Main loop"
DO i=1,mx
DO j=1,my
q1d(1) = qVar(i,j,1,iRho)
q1d(2) = qVar(i,j,1, iE)
q1d(3) = qVar(i,j,1, ivy)
q1d(4) = qVar(i,j,1, ivx)
CALL Cons2Prim(q1d(:), W1d(j,:))
ENDDO
CALL lr_states(grid, W1d, dt, dy, Wl, Wr, dir)
DO j=1,my
Flux(i,j,:) = hllc_flux(wl(j,:), wr(j,:))
ENDDO
DO j=1,my
CALL Prim2Cons(Wl(j,:),Ul(i,j,:))
CALL Prim2Cons(Wr(j,:),Ur(i,j,:))
ENDDO
ENDDO
PRINT *,"Deallocating"
DEALLOCATE(W1d,q1d,Wl,Wr)
PRINT *,"Returning"
I separated the DEALLOCATE statement into 4 separate statements and found that whichever 2D array would come first, W1d, wl, or wr, was the cause of the stall. Ignoring the DEALLOCATE statement (which should produce an automatic deallocate when going back to the main) also causes a stall. The subroutine for the x-direction flux has the same arrays, is called before this subroutine, and has no problems deallocating them.
Any suggestions?
EDIT This is run on Fedora 18 and compiled with Intel Fortran 2013.3. It is a parallelized code, but I am running it on a single processor for testing/debugging purposes.
I did three different things and it suddenly started working again. Two of them I do not believe could have done it, while it is possible the third did it. The changes I made:
I did have the bounds of i and j loops defined slightly differently, so I made it uniform between the two directional sweeps
I ran make clean and make
I added -check bounds -check pointers -check uninit flags to the Makefile
I think the first two did not really do anything. The variable grid in the code above is a 2x2 array that contains the bounds of qVar; in the x-sweep I had defined mx = grid(1,2) - grid(1,1) + 1, similarly for my, but grid(1,1) is 1, so it really does not do much different. The second item above I had done at least 3 times.
But the last one I tried once and it started working again. I do not know how that could have fixed it, so if someone does know, please tell me!

Fortran + Openmp more slow that sequential

I have this sequential code in Fortran. My problem is, when I put Openmp directives, the paralleled code is more slow than the sequential, and I don't see the error.
REAL, DIMENSION(:), ALLOCATABLE :: current, next
ALLOCATE ( current(TOTAL_Z), next(TOTAL_Z))
CALL CPU_TIME(t1)
!$OMP PARALLEL SHARED (current, next) PRIVATE (z)
DO t = 1, TOTAL_TIME
!$OMP DO SCHEDULE(STATIC, 2)
DO z = 2, (TOTAL_Z - 1)
next(z) = current (z) + KAPPA*DELTA_T*((current(z - 1) - 2.0*current(z) + current(z + 1)) / DELTA_Z**2)
END DO
!$OMP END DO
current = next
END DO
CALL CPU_TIME(t2)
!$OMP END PARALLEL
TOTAL_Z, TOTAL_TIME, KAPPA, DELTA_T, DELTA_Z are constants.
When I run the paralleled code, I see in htop and my 2 cores are working at 100%
In sequential code, CPU_TIME is 79 seg and in paralleled is 132 seg
Thank
I've just been experiencing the same problem.
It seems that using cpu_time() is not suitable to measure the performance of multi-threaded code. cpu_time() will add the total time of all the threads which is likely to increase with increasing number of threads.
I've found this in another forum,
http://software.intel.com/en-us/forums/topic/281897
You should use system_clock() or omp_get_wtime() functions to get a more accurate timing of your routine.
It is probably slow because of the threads are contending to access the shared variables. If you can change it to use reduction it would likely be faster. But that might not be easy since the calculation for "current" accesses multiple array elements.
depending on the number of iterations, you might also be facing a problem with false-sharing on the nest array. Since the chunk size for the distribution of the DO loop is rather small, the cache line for nest(z), nest(z+1), nest(z+2), nest(z+3), etc might be thrashing between the L1/L2 caches of the CPU.
Cheers,
-michael