I would like to replace the following do loop with FORTRAN's intrinsic functions and array notations.
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
However, as c(i,j) depends on c(i-1,j) none of following trials worked. Because they do not update c(i,j)
!FORALL(i = 2:n , j = 2:n ) c(i,j)=c(i-1,j)*(b(j)-b(j-1))+d(i,j)
!FORALL(i = 2:n) c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
!c(2:n,2:n)=RESHAPE( (/(c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=RESHAPE((/(((b(j)-b(j-1)) *c(i-1,j)+d(i,j) ,j=2,n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=spread(b(2:n)-b(1:n-1),ncopies = n-1,dim=1) * c(1:n-1,2:n) +d(2:n,2:n)
This is the best I can get. But it still has a do loop
do i=2, n
c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
end do
Could all do loops be replaced by intrinsic functions and array notation. Or could this one be replaced somehow ?
In my experience, nothing beats the traditional do-loop. All the extension intrinsics create memory and CPU overhead by copying stuff to temporary space (on the stack usually), reshaping and the sort. If you're manipulating large arrays, you may encounter out-of-memory issues with the intrinsic functions.
Your best option is to stick to a 2-d loop that has the indexes correctly laid out:
do i=2, n
e = c(1:n,i-1)
do j=2, n
a=b(j)-b(j-1)
c(j,i)=a*e(j)+d(j,i)
end do
end do
By replacing the indexes (and making sure your dimension declarations follow), you are saving on the memory paging. c(j,i) and d(j,i) references travel column-wise inside memory while c(j,i-1) would have cut across columns (and produced paging overhead). So we copy it to a temporary e array.
I think this will be the fastest....
Starting with
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
we can quickly eliminate the do loops with some nifty use of SIZE, SPREAD and EOSHIFT:
res = SPREAD(b - EOSHIFT(b,-1),2,SIZE(c,2))*EOSHIFT(c,-1) + d
Aha, turns out that the error I was receiving (V1) was due to my using RESHAPE rather than SPREAD. I fixed this in the current version (V2) and it compiles & works with both ifort and gfortran.
Related
I would like to vectorize this code below (just for an example), just assume somehow I should write an array inside an array.
PROGRAM TEST
IMPLICIT NONE
REAL, DIMENSION(2000):: A,B,C !100000
INTEGER, DIMENSION(2000):: E
REAL(KIND=8):: TIME1,TIME2
INTEGER::I
DO I=1, 2000 !Actually only this loop could be vectorized
B(I)=100.00 !by the compiler
C(I)=200.00
E(I)=I
END DO
!Computing computer's running time (start)
CALL CPU_TIME (TIME1)
DO I=1, 2000 !This is the problem, somehow I should put
A(E(I))=B(E(I))*C(E(I)) !an integer array E(I) inside an array
END DO !I would like to vectorize this loop also, but it didn't work
PRINT *, 'Results =', A(2000)
PRINT *, ' '
!Computing computer's running time (finish)
CALL CPU_TIME (TIME2)
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM TEST
I thought at first time, that compiler could understand what I want which somehow be vectorized like this:
DO I=1, 2000, 4 !Unrolled 4 times
A(E(I))=B(E(I))*C(E(I))
A(E(I+1))=B(E(I+1))*C(E(I+1))
A(E(I+2))=B(E(I+2))*C(E(I+2))
A(E(I+3))=B(E(I+3))*C(E(I+3))
END DO
but I was wrong. I used: gfortran -Ofast -o -fopt-info-optimized Tes.F95 and I got the information that only the first looping was successfully to be vectorized.
Do you have any idea how I could vectorize it? Or can't it be vectorized at all?
If E hase equal values for different I, then you would be manipulating the same elements of A multiple times, in which case the order could matter. (Though not in your case.) Also, if you have multiple index arrays, like E1, E2 and E3, and
DO I=1, 2000
A(E3(I))=B(E1(I))*C(E2(I))
END DO
the order could matter too. So I think this kind of indexing is not in general allowed in parallel loops.
With ifort one can use !DIR$ IVDEP which is "ignore Vector dependence". It only works when E(I) is linear as in the example...
Assuming that one wants to do all the indexes then just replace (E(i)) with (I) and work out the obvious E(I) order later...
I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.
The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.
Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.
So my question is: why doesn't a critical section work in this case?
My code is below. All shared variables except np, tm, hm, gam are read-only.
EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!
TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.
!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
DO i=1,nd
!$OMP CRITICAL (main)
! Second loop over the data:
DO j=1,nd
! The lag:
zdis = z(j) - z(i)
IF(zdis >= 0.0) THEN
izl = INT( zdis/dzlag+0.5)
ELSE
izl = -INT(-zdis/dzlag+0.5)
END IF
! ---- SNIP ----
! Loop over all variograms for this lag:
DO cur_variogram=1,nvarg
variogram_type = ivtype(cur_variogram)
! Get the head and tail values:
indx = i+(ivhead(cur_variogram)-1)*maxdim
vrh = vr(indx)
indx = j+(ivtail(cur_variogram)-1)*maxdim
vrt = vr(indx)
IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE
! ----- PROBLEM AREA -------
np(ixl,iyl,izl,1) = np(ixl,iyl,izl,1) + 1. ! <-- This never fails
tm(ixl,iyl,izl,1) = tm(ixl,iyl,izl,1) + vrt
hm(ixl,iyl,izl,1) = hm(ixl,iyl,izl,1) + vrh
gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
! ----- END OF PROBLEM AREA -----
!CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
END DO
END DO
!$OMP END CRITICAL (main)
END DO
!$OMP END DO
!$OMP END PARALLEL
Thanks very much in advance!
If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.
Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.
The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.
I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.
High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).
float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c); //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000
But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end
Edit:
Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.
I am trying to modifying legacy code to initialize array with openmp. However, I encounter Segmentation fault when enabling $OMP DO derivatives in the following code sections. Would you please point out what might be wrong?
I am using fortran and compile with gfortran and variables are declared as common variables
common/quant/keosc,vosc,rosc,frt,grt,dipole,v_solv
common/quant_avg/frt_avg,grt_avg,d_coup,rv_avg,b_avg
!$OMP PARALLEL
!$OMP DO private(m,j,l,mp) firstprivate(nstates,natoms) lastprivate(rv_avg,b_avg,grt_avg,frt_avg,d_coup)
do m = 0, nstates - 1
rv_avg(m) = 0d0
b_avg(m) = 0d0
do j = 1, 3
grt_avg(m,j) = 0d0
do l = 1, natoms
frt_avg(m,l,j) = 0d0
do mp = 0, nstates - 1
d_coup(m,mp,l,j) = 0d0
enddo
enddo
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL
Have you measured where the CPU consumption is in your program? It is a waste of effort to speed up portions that don't consume much CPU time. I'd be surprised if array initializations were a high fraction of the CPU usage. The code would be more readable if instead you used array notation, e.g., rv_avg (0:nstates - 1) = 0d0.
You haven't shown your declaration of the dimensions of any of the arrays so I speculate that the lines
do m = 0, nstates - 1
rv_avg(m) = 0d0
write to a non-existent element of rv_avg, that is the element at index 0. Since Fortran programs don't, by default, check that array element accesses are within bounds, this write outside the bounds won't be caught by the run-time. If the write stays within the address space of the program when it executes it won't cause a segmentation fault. Given the common block declarations the 0-th element of rv_avg may well be part of d_coup.
Shake up the mapping of variables to address space by introducing OpenMP and it's easy to believe that the0-th element of rv_avg now lies outside the address space for a thread and causes the segmentation fault.
Since the program makes other references to array elements at 0 any one of them might be at the root of the segmentation fault.
Of course, if you follow #M.S.B.'s advice and use array syntax notation you can avoid out-of-bounds array accesses.
The problem is probably that you do not have enough stack space in the OpenMP threads to hold the private copies of all these arrays. Especially d_coup looks like a really big one having 3 x natoms x nstates^2 elements. Most Fortran compilers nowadays automatically resort to using heap allocation for such big arrays but when it comes to (first|last)private variables, some OpenMP compilers, including GCC and Intel Fortran Compiler, always place them on the stack. See my answer here for more information.
Edit: Now I see that M. S. B. has actually linked to that same question in his comment.
My 2D hydro code stalls during the following subroutine (which computes the y-direction flux):
ALLOCATE(W1d(1:my,nFields),q1d(nFields),&
Wl(1:my,nFields),Wr(1:my,nFields))
PRINT *,"Main loop"
DO i=1,mx
DO j=1,my
q1d(1) = qVar(i,j,1,iRho)
q1d(2) = qVar(i,j,1, iE)
q1d(3) = qVar(i,j,1, ivy)
q1d(4) = qVar(i,j,1, ivx)
CALL Cons2Prim(q1d(:), W1d(j,:))
ENDDO
CALL lr_states(grid, W1d, dt, dy, Wl, Wr, dir)
DO j=1,my
Flux(i,j,:) = hllc_flux(wl(j,:), wr(j,:))
ENDDO
DO j=1,my
CALL Prim2Cons(Wl(j,:),Ul(i,j,:))
CALL Prim2Cons(Wr(j,:),Ur(i,j,:))
ENDDO
ENDDO
PRINT *,"Deallocating"
DEALLOCATE(W1d,q1d,Wl,Wr)
PRINT *,"Returning"
I separated the DEALLOCATE statement into 4 separate statements and found that whichever 2D array would come first, W1d, wl, or wr, was the cause of the stall. Ignoring the DEALLOCATE statement (which should produce an automatic deallocate when going back to the main) also causes a stall. The subroutine for the x-direction flux has the same arrays, is called before this subroutine, and has no problems deallocating them.
Any suggestions?
EDIT This is run on Fedora 18 and compiled with Intel Fortran 2013.3. It is a parallelized code, but I am running it on a single processor for testing/debugging purposes.
I did three different things and it suddenly started working again. Two of them I do not believe could have done it, while it is possible the third did it. The changes I made:
I did have the bounds of i and j loops defined slightly differently, so I made it uniform between the two directional sweeps
I ran make clean and make
I added -check bounds -check pointers -check uninit flags to the Makefile
I think the first two did not really do anything. The variable grid in the code above is a 2x2 array that contains the bounds of qVar; in the x-sweep I had defined mx = grid(1,2) - grid(1,1) + 1, similarly for my, but grid(1,1) is 1, so it really does not do much different. The second item above I had done at least 3 times.
But the last one I tried once and it started working again. I do not know how that could have fixed it, so if someone does know, please tell me!
I have a grid with thousands of double precision reals.
It's iterating through, and I need it to stop when it's reached convergence to 3 decimal places.
The target is to have it run as fast as possible, but needs to give the same result every (to 3 dp) every time.
At the minute I'm doing something like this
REAL(KIND=DP) :: TOL = 0.001_DP
DO WHILE(.NOT. CONVERGED)
CONVERGED = .TRUE.
DO I = 1, NUM_POINTS
NEW POTENTIAL = !blah blah blah
IF (CONVERGED) THEN
IF (NEW_POTENTIAL < OLD_POTENTIAL - TOL .OR. NEW_POTENTIAL > OLD_POTENTIAL + TOL) THEN
CONVERGED = .FALSE.
END IF
END IF
OLD_POTENTIAL = NEW POTENTIAL
END DO
END DO
I'm thinking that many IF statements can't be too great for performance. I thought about checking for convergence at the end; finding the average value (summing the whole grid, divide by num_points), and checking if that has converged in the same way as above, but I'm not convinced this will always be accurate.
What is the best way of doing this?
If I understand correctly you've got some kind of time-stepping going on, where you create the values in new_potential by calculations on old_potential. Then make old equal to new and carry on.
You could replace your existing convergence tests with the single statement
converged = all(abs(new_potential - old_potential)<tol)
which might be faster. If the speed of the test is a major concern you could test only every other (or every third or fourth ...) iteration
A few comments:
1) If you used a potential array with 2 planes, instead of an old_ and new_potential, you could transfer new_ into old_ by swapping indices at the end of each iteration. As your code stands there's a lot of data movement going on.
2) While semantically you are right to have a while loop, I'd always use a do loop with a maximum number of iterations, just in case the convergence criterion is never met.
3) In your declaration REAL(KIND=DP) :: TOL = 0.001_DP the specification of DP on the numerical value of TOL is redundant, REAL(KIND=DP) :: TOL = 0.001 is adequate. I'd also make this a parameter, the compiler may be able to optimise its use if it knows that it is immutable.
4) You don't really need to execute CONVERGED = .TRUE. inside the outermost loop, set it before the first iteration -- this will save you a nanosecond or two.
Finally, if your convergence criterion is that every element in the potential array has converged to 3dp then that is what you should test for. It would be relatively easy to construct counterexamples for your suggested averages. However, my concern would be that your system will never converge on every element and that you should be using some matrix norm computation to determine convergence. SO is not the place for a lesson in that topic.
What are the calculations for the convergence criteria? Unless they are worse then the calculations to advance the potential it is probably better to have the IF statement to terminate the loop as soon as possible rather than guess a very large number of iterations to be sure to obtain a good solution.
Re High Performance Mark's suggestion #1, if the copying operation is a significant portion of the run time, you could also use pointers.
The only way to be sure about this stuff is to measure the run time ... Fortran provides intrinsic functions to measure both CPU and clock time. Otherwise you may modify your some portion of you code to make it faster, perhaps making it less easier to understand and possibly introducing a bug, possibly without much improvement in runtime ... if that portion was taking a small amount of the total runtime, no amount of cleverness will can make much difference.
As High Performance Mark says, though the current semantics are elegant, you probably want to guard against an infinite loop. One approach:
PotentialLoop: do i=1, MaxIter
blah
Converged = test...
if (Converged) exit PotentialLoop
blah
end do PotentialLoop
if (.NOT. Converged) write (*, *) "error, did not converge"
I = 1
DO
NEWPOT = !bla bla bla
IF (ABS(NEWPOT-OLDPOT).LT.TOL) EXIT
OLDPOT = NEWPOT
I = MOD(I,NUMPOINTS) + 1
END DO
Maybe better
I = 1
DO
NEWPOT = !bla bla bla
IF (ABS(NEWPOT-OLDPOT).LT.TOL) EXIT
OLDPOT = NEWPOT
IF (I.EQ.NUMPOINTS) THEN
I = 1
ELSE
I = I + 1
END IF
END DO