The index of inner loops, Private or Shared? - fortran

This is my code, I want to make it parallel with OpenMP. I have one main loop to make parallel and some inner loops.
Are the indices of inner loops, like p, i or Li private or shared?
What happend if I do not declare the variables as private or shared?
Do you suggest to use the allocatable variables for this parallel loop?
!$OMP PARALLEL DO
do l = 1,n_rep
do p = 1,n_l - 1
do q = 1,n_l - 1
do r = 1,n_l - 1
Li = (p - 1)*(n_l - 1)**2 + (q - 1)*(n_l - 1) + r
alpha(Li) = pi*rand()
gamma(Li) = pi*rand()
beta(Li) = pi/2*rand()
R_x(1,1) = 1.d0
R_x(1,2) = 0.d0
R_x(1,3) = 0.d0
R_x(2,1) = 0.d0
R_x(2,2) = cos(alpha(Li))
R_x(2,3) = sin(alpha(Li))
R_x(3,1) = 0.d0
R_x(3,2) = -sin(alpha(Li))
R_x(3,3) = cos(alpha(Li))
R_y(1,1) = cos(beta(Li))
R_y(1,2) = 0.d0
R_y(1,3) = -sin(beta(Li))
R_y(2,1) = 0.d0
R_y(2,2) = 1.d0
R_y(2,3) = 0.d0
R_y(3,1) = sin(beta(Li))
R_y(3,2) = 0.d0
R_y(3,3) = cos(beta(Li))
R_z(1,1) = cos(gamma(Li))
R_z(1,2) = sin(gamma(Li))
R_z(1,3) = 0.d0
R_z(2,1) = -sin(gamma(Li))
R_z(2,2) = cos(gamma(Li))
R_z(2,3) = 0.d0
R_z(3,1) = 0.d0
R_z(3,2) = 0.d0
R_z(3,3) = 1.d0
R_xy = matmul(R_x,R_y)
R_xyz = matmul(R_xy,R_z)
do i = 1,n_f - 1
do j = 1,n_f - 1
do k = 1,n_f - 1
Li = (i - 1)*(n_f - 1)**2 + (j - 1)*(n_f - 1) + k
cf_x(i) = x_f(i) + (p - 1)*d_l - x_c(p)
cf_y(j) = y_f(j) + (q - 1)*d_l - y_c(q)
cf_z(k) = z_f(k) + (r - 1)*d_l - z_c(r)
x_rotated = R_xyz(1,1)*cf_x(i) + R_xyz(1,2)*cf_y(j) &
+ R_xyz(1,3)*cf_z(k)
y_rotated = R_xyz(2,1)*cf_x(i) + R_xyz(2,2)*cf_y(j) &
+ R_xyz(2,3)*cf_z(k)
z_rotated = R_xyz(3,1)*cf_x(i) + R_xyz(3,2)*cf_y(j) &
+ R_xyz(3,3)*cf_z(k)
enddo
enddo
enddo
enddo
enddo
enddo
enddo
!$OMP END PARALLEL DO

Personally I would break this problem up a bit.
Size_of_Array = n_l * n_l * n_l
IF(ALLOCATED(Li)) DEALLOCATE( Li )
ALLOCATE( Li (Size_of_Array))
IF(ALLOCATED(Alpha)) DEALLOCATE( Alpha)
ALLOCATE (Alpha (Size_of_Array))
IF(ALLOCATED(Beta)) DEALLOCATE( Beta )
ALLOCATE( Beta (Size_of_Array))
IF(ALLOCATED(Gamma)) DEALLOCATE( Gamma)
ALLOCATE( Gamma (Size_of_Array))
indexer = 0
do l = 1,n_rep
do p = 1,n_l - 1
do q = 1,n_l - 1
do r = 1,n_l - 1
indexer = indexer + 1
Li(Indexer) = (p - 1)*(n_l - 1)**2 + (q - 1)*(n_l - 1) + r
ENDDO
ENDDO
ENDDO
ENDDO
alpha = pi*rand()
gamma = pi*rand()
beta = pi/2*rand()
!?OMP DO PARALLEL
DO I= 1, SIZE(Li)
CALL Make_Array(Alpha(I), Beta(I), Gamma(I), MyArray(:,:,I) )
ENDDO
!etc
Basically moving the array to be inside of either an ELEMENTAL FUNCTION or a PURE SUBROUTINE. Then see what it does for speed with inlining and a single parallel do of some sort (OMP or other).
PURE SUBROUTINE Make_Array(Alpha, Beta, Gamma, MyArray)
IMPLICIT NONE
DOUBLE, INTENT(IN ) :: Alpha
DOUBLE, INTENT(IN ) :: Beta
DOUBLE, INTENT(IN ) :: Gamma
DOUBLE, DIMENSION(3,3) INTENT(INOUT) :: MyArray ! Maybe just intent(OUT)?
R_x(:,:) = 0.d0
R_x(1,1) = 1.d0
R_x(2,2) = cos(alpha)
R_x(2,3) = sin(alpha)
R_x(3,2) = -sin(alpha)
R_x(3,3) = cos(alpha)
R_y(1,1) = cos(beta)
R_y(1,3) = -sin(beta)
R_y(2,1) = 0.d0
R_y(2,2) = 1.d0
R_y(2,3) = 0.d0
R_y(3,1) = sin(beta(Li))
R_y(3,2) = 0.d0
R_y(3,3) = cos(beta(Li))
R_z(1,1) = cos(gamma(Li))
R_z(1,2) = sin(gamma(Li))
R_z(1,3) = 0.d0
R_z(2,1) = -sin(gamma(Li))
R_z(2,2) = cos(gamma(Li))
END SUBROUTINE Make_Array
Etc... For other elemental functions or pure subroutines
R_xy = matmul(R_x,R_y)
R_xyz = matmul(R_xy,R_ ...

Related

Frustrating code gives Bad real numer in item 1 of list input

I am having a nightmare of a time trying to run the following code with the input values below. Now when i run it i get the following error , i read some place that it was due to the fact that my 2,txt file looks messy if i dont save it in a UTF -16 bit, is that something that is possible?
At line 11 of file bndry.f (unit = 5, file = '2.txt')
Fortran runtime error: Bad real number in item 1 of list input
Main code:
IMPLICIT NONE
INTEGER MS,NS,JS,N,I
PARAMETER(MS=50000,NS=50000)
REAL*8 S,K,TAU,SIGMA,R,DELTA,SMIN,SMAX,DTAU,ALPHA,BETA,LAM
REAL*8 V(0:MS),BD(0:NS),T(0:NS)
COMMON/OUTPUT/V,BD,T
OPEN(UNIT=5, FILE='2.f')
WRITE(6,*) S,K,TAU,SIGMA,R,DELTA,DTAU
READ(5,*) S,K,TAU,SIGMA,R,DELTA,DTAU
N = TAU/DTAU
ALPHA = 2.0D0*R/SIGMA**2
BETA = 2.0D0*(R-DELTA)/SIGMA**2
LAM = (BETA-1.0D0) + DSQRT((BETA-1.0D0)**2+4.0D0*ALPHA)
LAM = LAM/2.0D0
SMIN = K/(1.0D0+1.0D0/LAM) ! PERPETUAL BOUNDARY
SMAX = 10.0D0*K
CALL EXP_DIFF(S,K,TAU,SIGMA,R,DELTA,SMAX,SMIN,DTAU,JS)
WRITE(6,*) 'PRICE: ', V(JS)
DO I = 0, N
WRITE(2,10) T(I),BD(I)
ENDDO
10 FORMAT(1X,2F14.8)
STOP
END
C=======================================================================
SUBROUTINE EXP_DIFF(S,K,TAU,SIGMA,R,DELTA,SMAX,SMIN,DTAU,JS)
IMPLICIT NONE
INTEGER MS,NS,JS,M,N,I,J,IEARLY
PARAMETER(MS=50000,NS=50000)
REAL*8 S,K,TAU,SIGMA,R,DELTA,XMIN,XMAX,DTAU,DX,ALPHA,SMIN,SMAX,
& P1,P2,P3,VC,A,B
REAL*8 VE(0:MS),V(0:MS),BD(0:NS),T(0:NS)
COMMON/OUTPUT/V,BD,T
IF (S.GT.SMAX) THEN
STOP 'THE OPTION IS WORHTLESS'
ENDIF
IF (S.LT.SMIN) THEN
STOP 'THE OPTION WORTHS K-S FOR CERTAIN'
ENDIF
XMIN = DLOG(SMIN)
XMAX = DLOG(SMAX)
DX = SIGMA*DSQRT(3.0*DTAU)
JS = (DLOG(S)-XMIN)/DX
DX = (DLOG(S)-XMIN)/FLOAT(JS)
ALPHA = R - DELTA - SIGMA**2/2.0
P1 = SIGMA**2*DTAU/(2.0*DX**2) + ALPHA*DTAU/(2.0*DX)
P2 = 1.0 - SIGMA**2*DTAU/DX**2
P3 = 1.0 - P1 -P2
P1 = P1/(1.0+R*DTAU)
P2 = P2/(1.0+R*DTAU)
P3 = P3/(1.0+R*DTAU)
WRITE(6,*) 'P1,P2,P3',P1,P2,P3
IF (P1.LT.0.0.OR.P2.LT.0.0.OR.P3.LT.0.0) STOP 'DECREASE DTAU'
M = (XMAX-XMIN)/DX
N = TAU/DTAU
IF (M.GT.MS.OR.N.GT.NS) STOP 'INCREASE MS AND NS'
DO J = 0, M
VE(J) = MAX(K-DEXP(J*DX+XMIN),0.0)
V(J) = VE(J)
ENDDO
BD(0) = K
T(0) = 0.0
DO I = 1, N
IEARLY = 0
A = V(M)
B = V(M-1)
DO J = M-1, 1, -1
VC = P1*A+P2*B+P3*V(J-1)
IF (VC.LT.VE(J).AND.IEARLY.EQ.0) THEN
BD(I) = DEXP(XMIN+J*DX)
T(I) = DTAU*DFLOAT(I)
IEARLY = 1
ENDIF
V(J) = MAX(VC, VE(J))
A = B
B = V(J-1)
ENDDO
ENDDO
RETURN
END
Data in 2.f file:
S = 100.0DO
K = 100.0D0
TAU = 3.0D0
SIGMA = 0.2D0
R = 0.08D0
DELTA = 0.04D0
DTAU = 0.03D0
JS is coming out to zero. You then divide by it.
Why is JS an INTEGER?
You're trying to read the names of the variables. The input file works fine like this:
100.0D0
100.0D0
3.0D0
0.2D0
0.08D0
0.04D0
0.03D0
And the last 0 on the first line was a capital O.

Another good ol' segmentation error in fortran

I know that people ask a LOT of segmentation err questions here, but I've put my effort in solving this problem for more than three hour and still wasn't able to solve this. :/ So here is my code:
c sinle event analysis
implicit real(a-h,o-z)
real day(12), nmonth(12), year(12), clas(12),
$ hour(12), nmin(12)
integer mark(12)
real tst(12), D(12), avgP(12,6), avgA(12,6)
integer k, m, n, g
real time(2054904), proa(2054904), prob(2054904), w1(2054904),
$ w2(2054904), w3(2054904), w4(2054904)
D(1) = 31; D(2) = 28; D(3) = 31; D(4) = 30; D(5) = 31;
D(6) = 30; D(7) = 31; D(8) = 31; D(9) = 30; D(10) = 31;
D(11) = 30; D(12) = 31
open(100,file='singleE.dat')
do i=1, 12
tst(i)=0
enddo
900 do i=1, 12
read(100, 1150) day(i), nmonth(i), year(i),
$ hour(i), nmin(i), clas(i)
do j=1, 12
if (int(nmonth(i)).EQ.(13-j)) then
tst(i) = tst(i) + D(12-j)
nmonth(i) = nmonth(i)-1
endif
enddo
tst(i) = tst(i) + day(i) + (year(i) - 2010)*365
$ + (hour(i) + nmin(i)/60)/24
if (year(i) > real(2011)) then
tst(i) = tst(i) + 1
endif
enddo
open(200,file='hole.dat',status='OLD')
k = 0
do i=1, 2054904
read(200,950) time(i), proa(i), prob(i),
$ w1(i), w2(i), w3(i), w4(i)
enddo
mark = 0
do i=1, 12
do j=1, 2054904
k = k + 1
if(abs(tst(i)-time(j))<0.0001) then
mark(i) = k
endif
enddo
enddo
n = 5;
do i= 1, 12
do j= 1,6
avgP(i,j) = 0
avgA(i,j) = 0
enddo
enddo
do i=1, 12
if (mark(i).EQ.0) then
go to 750
endif
do j = (mark(i)-(n+1)*1440), (mark(i)-n*1440)
avgP(i,1) = avgP(i,1) + proa(j)
avgP(i,2) = avgP(i,2) + prob(j)
avgP(i,3) = avgP(i,3) + w1(j)
avgP(i,4) = avgP(i,4) + w2(j)
avgP(i,5) = avgP(i,5) + w3(j)
avgP(i,6) = avgP(i,6) + w4(j)
enddo
do g = (mark(i)+n*1440), (mark(i)+(n+1)*1440)
avgA(i,1) = avgA(i,1) + proa(g)
avgA(i,2) = avgA(i,2) + prob(g)
avgA(i,3) = avgA(i,3) + w1(g)
avgA(i,4) = avgA(i,4) + w2(g)
avgA(i,5) = avgA(i,5) + w3(g)
avgA(i,6) = avgA(i,6) + w4(g)
enddo
750 print *, avgP(i,1), avgP(i,2), avgP(i,3), avgP(i,4),
$ avgP(i,5), avgP(i,6)
enddo
850 close(i)
950 FORMAT(F12.7,2x,E10.3,2x,E10.3,2x,E10.3,2x,E10.3,
$ 2x,E10.3,2x,E10.3)
1150 FORMAT(F2.0,1x,F2.0,1x,F4.0,1x,F2.0,1x,F2.0,4x F3.1)
end
The part that is causing me trouble is the loop here:
do i=1, 12
if (mark(i).EQ.0) then
go to 750
endif
do j = (mark(i)-(n+1)*1440), (mark(i)-n*1440)
avgP(i,1) = avgP(i,1) + proa(j)
avgP(i,2) = avgP(i,2) + prob(j)
avgP(i,3) = avgP(i,3) + w1(j)
avgP(i,4) = avgP(i,4) + w2(j)
avgP(i,5) = avgP(i,5) + w3(j)
avgP(i,6) = avgP(i,6) + w4(j)
enddo
do g = (mark(i)+n*1440), (mark(i)+(n+1)*1440)
avgA(i,1) = avgA(i,1) + proa(g)
avgA(i,2) = avgA(i,2) + prob(g)
avgA(i,3) = avgA(i,3) + w1(g)
avgA(i,4) = avgA(i,4) + w2(g)
avgA(i,5) = avgA(i,5) + w3(g)
avgA(i,6) = avgA(i,6) + w4(g)
enddo
enddo
Using gdb, I found out that the 'j' loop is causing trouble. All of the parameters are fine, but every time I execute the program the 'j' loop goes only once. The peculiar thing is that as 'i' increases the processes start to cripple one by one. for example, at i = 1 the loop executes well. Then, at i = 2, avgP(i,6) = avgP(i,6) + w4(j) causes the seg fault. At i = 3, avgP(i,5) = avgP(i,5) + w3(j) causes the seg fault and in the end at i = 7, entire loop doesn't work. What a strange error! Hope that I can get some help with this.
The loop
do j=1, 12
if (int(nmonth(i)).EQ.(13-j)) then
tst(i) = tst(i) + D(12-j)
nmonth(i) = nmonth(i)-1
endif
enddo
potentially tries to access the value D(0) when j=12 however D is dimensioned 1:12 so who knows what is being written to tst and its consequences.
This should be caught by turning on bounds checking.

OpenMP computation errors

When I enable the OpenMP lines in the following code, I do not often get the correct solutions (so I suspect something is wrong with the parallelization). I have gone through the code over and over again, yet I still could not find where is the problem.
!$OMP PARALLEL SHARED(w, h, u, v, hu, hv, d) &
!$OMP& SHARED(nxw, nyw) &
!$OMP& SHARED(rx, ry, rxg, ryg) &
!$OMP& SHARED(ispans, ispane, jspans, jspane, chunk) &
!$OMP& SHARED(a) &
!$OMP& PRIVATE(i, j, it) &
!$OMP& PRIVATE(ip1, im1, im2, jp1, jm1, jm2, iu1, jv1) &
!$OMP& PRIVATE(ww, hh, uu, vv, hu1, hu2, hv1, hv2, dd, df) &
!$OMP& PRIVATE(xvv, xuu, xve, xue, advx, advy) &
!$OMP& PRIVATE(c, dwdt_i, dwdt_f, dwdt, ref, coef) &
!$OMP& PRIVATE(du, dv, noflux)
ompstart = OMP_GET_WTIME()
do it = 1, itlast
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 2, nyw-1
do i = 2, nxw-1
if (d(i,j) .gt. rpmax) then
ww = w(i,j,1) - rx*(u(i,j,1) - u(i-1,j,1)) &
- ry*(v(i,j,1) - v(i,j-1,1))
if (abs(ww) .lt. eps) ww = 0.0
hh = ww + d(i,j)
if (hh .ge. gx) then
h(i,j,2) = hh
w(i,j,2) = ww
else
h(i,j,2) = 0.0
w(i,j,2) = -d(i,j)
end if
else
h(i,j,2) = 0.0
w(i,j,2) = -d(i,j)
end if
end do
end do
!$OMP END DO
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 1, nyw
do i = 1, nxw-1
hu1 = 0.25*(h(i,j,2) + h(i+1,j,2) + h(i,j,1) + h(i+1,j,1))
hu2 = 0.50*(h(i,j,2) + h(i+1,j,2))
if (hu1 .lt. gx) hu1 = 0.0
if (hu2 .lt. gx) hu2 = 0.0
hu(i,j,1) = hu1
hu(i,j,2) = hu2
end do
end do
!$OMP END DO
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 1, nyw-1
do i = 1, nxw
hv1 = 0.25*(h(i,j,2) + h(i,j+1,2) + h(i,j,1) + h(i,j+1,1))
hv2 = 0.50*(h(i,j,2) + h(i,j+1,2))
if (hv1 .lt. gx) hv1 = 0.0
if (hv2 .lt. gx) hv2 = 0.0
hv(i,j,1) = hv1
hv(i,j,2) = hv2
end do
end do
!$OMP END DO
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do i = 1, nxw-1
ip1 = i+1
im1 = i-1
if (im1 .le. 1) im1 = 1
if (ip1 .ge. nxw-1) ip1 = nxw-1
do j = 1, nyw
noflux = 0
jp1 = j+1
jm1 = j-1
jv1 = j
if (jm1 .le. 1) jm1 = 1
if (jp1 .ge. nyw) jp1 = nyw
if (jv1 .ge. nyw-1) jv1 = nyw-1
du = 0.5*(d(i,j) + d(i+1,j))
if (d(i,j) .le. rpmax .or. du .le. rpmax) then
u(i,j,2) = 0.0
noflux = 1
else
if (h(i,j,2) .gt. gx .and. h(i+1,j,2) .gt. gx) then
dd = hu(i,j,2)
df = hu(i,j,1)
else if (h(i,j,2) .gt. gx .and. h(i+1,j,2) .le. gx &
.and. d(i+1,j) + w(i,j,2) .gt. gx) then
dd = 0.5*h(i,j,2)
df = dd
else if (h(i,j,2) .le. gx .and. h(i+1,j,2) .gt. gx &
.and. d(i,j) + w(i+1,j,2) .gt. gx) then
dd = 0.5*h(i+1,j,2)
df = dd
else
u(i,j,2) = 0.0
noflux = 1
end if
if (dd .lt. gx) then
u(i,j,2) = 0.0
noflux = 1
end if
end if
if (noflux .ne. 1) then
xvv = 0.25*(v(i,jv1,1) + v(i+1,jv1,1) + &
v(i,jm1,1) + v(i+1,jm1,1))
uu = u(i,j,1) - rxg*dd*(w(i+1,j,2) - w(i,j,2))
if (hu(i,j,1) .ge. gx .and. &
(i .gt. ispans .and. i .lt. ispane .and. &
j .gt. jspans .and. j .lt. jspane)) then
advx = 0.0
advy = 0.0
if (u(i,j,1) .lt. zero) then
if (hu(ip1,j,1) .lt. gx .or. &
h(ip1,j,2) .lt. gx) then
advx = rx*(-u(i,j,1)**2.0/hu(i,j,1))
else
advx = rx*(u(ip1,j,1)**2.0/hu(ip1,j,1) &
- u(i,j,1)**2.0/hu(i,j,1))
end if
else
if (hu(im1,j,1) .lt. gx .or. &
h(i,j,2) .lt. gx) then
advx = rx*(u(i,j,1)**2.0/hu(i,j,1))
else
advx = rx*(u(i,j,1)**2.0/hu(i,j,1) &
- u(im1,j,1)**2.0/hu(im1,j,1))
end if
end if
if (xvv .lt. zero) then
if (h(i,jp1,2) .lt. gx .or. &
h(ip1,jp1,2) .lt. gx) then
advy = ry*(-u(i,j,1)*xvv/hu(i,j,1))
else if (hu(i,jp1,1) .lt. gx) then
advy = ry*(-u(i,j,1)*xvv/hu(i,j,1))
else
xve = 0.25*(v(i,jp1,1) + v(ip1,jp1,1) &
+ v(i,j,1) + v(ip1,j,1))
advy = ry*(u(i,jp1,1)*xve/hu(i,jp1,1) &
- u(i,j,1)*xvv/hu(i,j,1))
end if
else
if (h(i,jm1,2) .lt. gx .or. &
h(ip1,jm1,2) .lt. gx) then
advy = ry*(u(i,j,1)*xvv/hu(i,j,1))
else if (hu(i,jm1,1) .lt. gx) then
advy = ry*(u(i,j,1)*xvv/hu(i,j,1))
else
jm2 = j-2
if (jm2 .le. 1) jm2 = 1
xve = 0.25*(v(i,jm1,1) + v(ip1,jm1,1) &
+ v(i,jm2,1) + v(ip1,jm2,1))
advy = ry*(u(i,j,1)*xvv/hu(i,j,1) &
- u(i,jm1,1)*xve/hu(i,jm1,1))
end if
end if
C = SQRT(GRAV*H(I,J,2))
DWDT_F = 0.08*C
DWDT_I = 0.65*C
DWDT = ABS(W(I,J,2) - W(I,J,1))/DELT
IF (DWDT .GT. DWDT_F) THEN
REF = (DWDT - DWDT_F)/(DWDT_I - DWDT_F)
A = 1.0
COEF = EXP(-A*REF)
ADVX = COEF*ADVX
ADVY = COEF*ADVY
END IF
uu = uu - advx - advy
end if
if (abs(uu) .lt. eps) uu = 0.0
if (uu .gt. 20.0*dd) uu = 20.0*dd
if (uu .lt. -20.0*dd) uu = -20.0*dd
u(i,j,2) = uu
else
u(i,j,2) = 0.0
end if
end do
end do
!$OMP END DO
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do i = 1, nxw
ip1 = i+1
im1 = i-1
iu1 = i
if (im1 .le. 1) im1 = 1
if (ip1 .ge. nxw) ip1 = nxw
if (iu1 .ge. nxw-1) iu1 = nxw-1
do j = 1, nyw-1
noflux = 0
jp1 = j+1
jm1 = j-1
if (jm1 .le. 1) jm1 = 1
if (jp1 .ge. nyw-1) jp1 = nyw-1
dv = 0.5*(d(i,j) + d(i,j+1))
if (d(i,j) .le. rpmax .or. dv .le. rpmax) then
v(i,j,2) = 0.0
noflux = 1
else
if (h(i,j,2) .gt. gx .and. h(i,j+1,2) .gt. gx) then
dd = hv(i,j,2)
df = hv(i,j,1)
else if (h(i,j,2) .gt. gx .and. h(i,j+1,2) .le. gx &
.and. d(i,j+1) + w(i,j,2) .gt. gx) then
dd = 0.5*h(i,j,2)
df = dd
else if (h(i,j,2) .le. gx .and. h(i,j+1,2) .gt. gx &
.and. d(i,j) + w(i,j+1,2) .gt. gx) then
dd = 0.5*h(i,j+1,2)
df = dd
else
v(i,j,2) = 0.0
noflux = 1
end if
if (dd .lt. gx) then
v(i,j,2) = 0.0
noflux = 1
end if
end if
if (noflux .ne. 1) then
xuu = 0.25*(u(iu1,j,1) + u(iu1,jp1,1) + &
u(im1,j,1) + u(im1,jp1,1))
vv = v(i,j,1) - ryg*dd*(w(i,j+1,2) - w(i,j,2))
if (hv(i,j,1) .ge. gx .and. &
(i .gt. ispans .and. i .lt. ispane .and. &
j .gt. jspans .and. j .lt. jspane)) then
advx = 0.0
advy = 0.0
if (v(i,j,1) .lt. zero) then
if (hv(i,jp1,1) .lt. gx .or. &
h(i,jp1,2) .lt. gx) then
advy = ry*(-v(i,j,1)**2.0/hv(i,j,1))
else
advy = ry*(v(i,jp1,1)**2.0/hv(i,jp1,1) &
- v(i,j,1)**2.0/hv(i,j,1))
end if
else
if (hv(i,jm1,1) .lt. gx .or. &
h(i,j,2) .lt. gx) then
advy = ry*(v(i,j,1)**2.0/hv(i,j,1))
else
advy = ry*(v(i,j,1)**2.0/hv(i,j,1) &
- v(i,jm1,1)**2.0/hv(i,jm1,1))
end if
end if
if (xuu .lt. zero) then
if (h(ip1,j,2) .lt. gx .or. &
h(ip1,jp1,2) .lt. gx) then
advx = rx*(-v(i,j,1)*xuu/hv(i,j,1))
else if (hv(ip1,j,1) .lt. gx) then
advx = rx*(-v(i,j,1)*xuu/hv(i,j,1))
else
xue = 0.25*(u(ip1,j,1) + u(ip1,jp1,1) &
+ u(i,j,1) + u(i,jp1,1))
advx = rx*(v(ip1,j,1)*xue/hv(ip1,j,1) &
- v(i,j,1)*xuu/hv(i,j,1))
end if
else
if (h(im1,j,2) .lt. gx .or. &
h(im1,jp1,2) .lt. gx) then
advx = rx*(v(i,j,1)*xuu/hv(i,j,1))
else if (hv(im1,j,1) .lt. gx) then
advx = rx*(v(i,j,1)*xuu/hv(i,j,1))
else
im2 = i-2
if (im2 .le. 1) im2 = 1
xue = 0.25*(u(im1,j,1) + u(im1,jp1,1) &
+ u(im2,j,1) + u(im2,jp1,1))
advx = rx*(v(i,j,1)*xuu/hv(i,j,1) &
- v(im1,j,1)*xue/hv(im1,j,1))
end if
end if
C = SQRT(GRAV*H(I,J,2))
DWDT_F = 0.08*C
DWDT_I = 0.65*C
DWDT = ABS(W(I,J,2) - W(I,J,1))/DELT
IF (DWDT .GT. DWDT_F) THEN
REF = (DWDT - DWDT_F)/(DWDT_I - DWDT_F)
A = 1.0
COEF = EXP(-A*REF)
ADVX = COEF*ADVX
ADVY = COEF*ADVY
END IF
vv = vv - advx - advy
end if
if (abs(vv) .lt. eps) vv = 0.0
if (vv .gt. 20.0*dd) vv = 20.0*dd
if (vv .lt. -20.0*dd) vv = -20.0*dd
v(i,j,2) = vv
else
v(i,j,2) = 0.0
end if
end do
end do
!$OMP END DO
!$OMP SINGLE
call tuna_openbc
!$OMP END SINGLE
!$OMP SINGLE
call tuna_update
!$OMP END SINGLE
end do
ompend = OMP_GET_WTIME()
!$OMP END PARALLEL
Above code is the computation part (it is a lengthy code). I make sure all the loop counters it, i, j, and dummy loop counters ip1, im1, jp1, jm1, etc are set as private. Also all the dummy variables ww, uu, vv, etc are all set as private. I am not sure where I did wrong. All my variables and constants are declared in a module as shown below.
real, parameter :: gx = 1.0e-5
real, parameter :: eps = 1.0e-10
real, parameter :: zero = 0.0
real, parameter :: rpmax = -20.0
real, parameter :: grav = 9.807
real, dimension(:,:,:), allocatable :: w
real, dimension(:,:,:), allocatable :: h
real, dimension(:,:,:), allocatable :: u
real, dimension(:,:,:), allocatable :: v
real, dimension(:,:,:), allocatable :: hu
real, dimension(:,:,:), allocatable :: hv
real, dimension(:,:), allocatable :: d
integer :: nxw, nyw
real :: delx, dely, delt
Please share if you find something is not right.
AFAICS, your code exposes two oddities:
The variable a is declared shared whereas it looks like it should have been declared private
The two variables ompstart and ompend should probably be initialised outside of the parallel region. Otherwise, all threads will try to update them concurrently.
All other variables look fine to me. In addition, I tried to check for potential dependencies between various indexes of u and v, but didn't spot any. So maybe declaring a private might be enough to fix your code.

I've done the same code on both MATLAB and Python, but ifft2 returns different values?

I've been trying to implement an homomorphic filter in frequency domain on both MATLAB and Python using OpenCV2 and NumPy, the MATLAB code gives the expected answer but the Python does not, the resulting image is very weird. I've tested all variables and came to the conclusion the only point there is a difference is the IFFT. On MATLAB, the results can be applied normally to the exp function and return the filtered original image expected, but the values of Python ifft are very different. I happened to see other posts with similar problems, but no satisfactory answer (perhaps i'm just bad at searching too...).
The MATLAB code
function [ img_r ] = homomorphic( img, D0, n )
[N, M] = size(img);
img_bk = double(img);
img_bk = log(img_bk+1);
img_freq = fftshift(fft2(img_bk));
magA = uint8(10*log(1+abs(img_freq)));
cu = M/2;
cv = N/2;
Hf = zeros(N,M);
for v = 1:N
dv = v - cv;
for u = 1:M
du = u - cu;
D = sqrt(du*du + dv*dv);
num = 1;
if D > 0
den = 1+((D0/D)^(2*n));
else
den = 0; %to replace +inf
end
if den ~= 0
H = num/den;
else
H = 0;
end
img_freq(v,u) = H*img_freq(v,u);
end
end
magB = uint8(10*log(1+abs(img_freq)));
img_r = (ifft2(ifftshift(img_freq)));
img_r = exp(img_r);
img_r = uint8(img_r);
and the Python code (might have some bugs but overall works)
import numpy as np
import cv2
def homomorphic(img, D0, n=2):
[N,M] = img.shape
img_bk = np.log(1 + np.float64(img))
img_freq = np.fft.fftshift(np.fft.fft2(img_bk))
cu = M/2.0
cv = N/2.0
for v in range(N):
dv = v - cv
for u in range(M):
du = u - cu
D = np.sqrt(du*du + dv*dv)
if D != 0:
a = 1.0 + (D0/D)**(2*n)
H = 1/a
else:
print D
H = 0
img_freq[v][u] = H*img_freq[v][u]
img_r = np.abs(np.fft.ifft2(np.fft.ifftshift(img_freq)))
eimg = np.exp(img_r)
eimg = np.uint8(eimg)
return eimg
I really don't get it, why the results are so different? Does anyone have any idea?

More threads added then slower OPENMP code

I tried to parallel a piece of code with OPENMP, but with increasing the number of processors, the code runs slower.!
call OMP_set_num_threads(1)-->16.7sec
call OMP_set_num_threads(4)-->17.7sec
call OMP_set_num_threads(8)-->19sec
System SPEC
Intel Corei7 3610QM 2.3GH up to 3.2GH with 4 cores and 8 threads
///8GB ram DDR3
call OMP_set_num_threads(8)
!$omp parallel
!$omp do private(k,i,j,r,epsilonxx,epsilonyy,epsilonxy,epsilonzz,epsilonxz,&
epsilonyz) reduction(+:dr)
do k=1,niac
i = pair_i(k)
j = pair_j(k)
dx(1) = x(1,j) - x(1,i)
dr = dx(1)*dx(1)
do d=2,dim
dx(d) = x(d,j) - x(d,i)
dr = dr + dx(d)*dx(d)
enddo
r = sqrt(dr)
do d=1,dim
dvx(d) = vx(d,j) - vx(d,i)
enddo
if (dim.eq.3) then
if((abs(itype(i)).gt.1000 .and. abs(itype(j)).gt.1000 ) ) then
epsilonxx = dvx(1)*dwdx(1,k)
epsilonyy = dvx(2)*dwdx(2,k)
epsilonxy = (1/2.)*(dvx(1)*dwdx(2,k)+dvx(2)*dwdx(1,k))
epsilonzz = dvx(dim)*dwdx(dim,k)
epsilonxz = (1/2.)*(dvx(1)*dwdx(dim,k)+dvx(dim)*dwdx(1,k))
epsilonyz = (1/2.)*(dvx(2)*dwdx(dim,k)+dvx(dim)*dwdx(2,k))
epsxx(i) = epsxx(i) + mass(j)*epsilonxx/rho(j)
epsxx(j) = epsxx(j) + mass(i)*epsilonxx/rho(i)
epsyy(i) = epsyy(i) + mass(j)*epsilonyy/rho(j)
epsyy(j) = epsyy(j) + mass(i)*epsilonyy/rho(i)
epsxy(i) = epsxy(i) + mass(j)*epsilonxy/rho(j)
epsxy(j) = epsxy(j) + mass(i)*epsilonxy/rho(i)
epszz(i) = epszz(i) + mass(j)*epsilonzz/rho(j)
epszz(j) = epszz(j) + mass(i)*epsilonzz/rho(i)
epsxz(i) = epsxz(i) + mass(j)*epsilonxz/rho(j)
epsxz(j) = epsxz(j) + mass(i)*epsilonxz/rho(i)
epsyz(i) = epsyz(i) + mass(j)*epsilonyz/rho(j)
epsyz(j) = epsyz(j) + mass(i)*epsilonyz/rho(i)
elseif( (abs(itype(i)).lt.1000 ) .and. (abs(itype(j)).gt.1000 ) ) then
epsilonxx_interface(i) =(2/3.)*(2.e0*dvx(1)*dwdx(1,k)
epsilonxx_interface(j) =dvx(1)*dwdx(1,k)
epsilonyy_interface(i) =(2/3.)*(2.e0*dvx(2)*dwdx(2,k)
epsilonyy_interface(j) =dvx(2)*dwdx(2,k)
epsilonxy_interface(i) =dvx(1)*dwdx(2,k) + dvx(2)*dwdx(1,k)
epsilonxy_interface(j) =(1/2.)*(dvx(1)*dwdx(2,k)+dvx(2)*dwdx(1,k))
epsilonzz_interface(i) =(2/3.)*(2.e0*dvx(dim)*dwdx(dim,k)
epsilonzz_interface(j) =dvx(dim)*dwdx(dim,k) epsilonxz_interface(i) =dvx(1)*dwdx(dim,k) + dvx(dim)*dwdx(1,k)
epsilonxz_interface(j) =(1/2.)*(dvx(1)*dwdx(dim,k)+dvx(dim)*dwdx(1,k))
epsilonyz_interface(i) =dvx(2)*dwdx(dim,k) + dvx(dim)*dwdx(2,k)
epsilonyz_interface(j) =(1/2.)*(dvx(2)*dwdx(dim,k)+dvx(dim)*dwdx(2,k))
epsxx(i) = epsxx(i) + mass(j)*epsilonxx_interface(i)/rho(j)
epsxx(j) = epsxx(j) + mass(i)*epsilonxx_interface(j)/rho(i)
epsyy(i) = epsyy(i) + mass(j)*epsilonyy_interface(i)/rho(j)
epsyy(j) = epsyy(j) + mass(i)*epsilonyy_interface(j)/rho(i)
epsxy(i) = epsxy(i) + mass(j)*epsilonxy_interface(i)/rho(j)
epsxy(j) = epsxy(j) + mass(i)*epsilonxy_interface(j)/rho(i)
epszz(i) = epszz(i) + mass(j)*epsilonzz_interface(i)/rho(j)
epszz(j) = epszz(j) + mass(i)*epsilonzz_interface(j)/rho(i)
epsxz(i) = epsxz(i) + mass(j)*epsilonxz_interface(i)/rho(j)
epsxz(j) = epsxz(j) + mass(i)*epsilonxz_interface(j)/rho(i)
epsyz(i) = epsyz(i) + mass(j)*epsilonyz_interface(i)/rho(j)
epsyz(j) = epsyz(j) + mass(i)*epsilonyz_interface(j)/rho(i)
elseif( (abs(itype(i)).gt.1000 ) .and. (abs(itype(j)).lt.1000 ) ) then
epsilonxx_interface(j) = (2/3.)*(2.e0*dvx(1)*dwdx(1,k)
epsilonxx_interface(i) =dvx(1)*dwdx(1,k)
epsilonyy_interface(j) =(2/3.)*(2.e0*dvx(2)*dwdx(2,k)
epsilonyy_interface(i) = dvx(2)*dwdx(2,k)
epsilonxy_interface(j) =dvx(1)*dwdx(2,k) + dvx(2)*dwdx(1,k)
epsilonxy_interface(i) = (1/2.)*(dvx(1)*dwdx(2,k)+dvx(2)*dwdx(1,k))
epsilonzz_interface(j) = (2/3.)*(2.e0*dvx(dim)*dwdx(dim,k)
epsilonzz_interface(i) =dvx(dim)*dwdx(dim,k)
epsilonxz_interface(j) =dvx(1)*dwdx(dim,k) + dvx(dim)*dwdx(1,k)
epsilonxz_interface(i) =(1/2.)*(dvx(1)*dwdx(dim,k)+dvx(dim)*dwdx(1,k))
epsilonyz_interface(j) =dvx(2)*dwdx(dim,k) + dvx(dim)*dwdx(2,k)
epsilonyz_interface(i) =(1/2.)*(dvx(2)*dwdx(dim,k)+dvx(dim)*dwdx(2,k))
epsxx(i) = epsxx(i) + mass(j)*epsilonxx_interface(i)/rho(j)
epsxx(j) = epsxx(j) + mass(i)*epsilonxx_interface(j)/rho(i)
epsyy(i) = epsyy(i) + mass(j)*epsilonyy_interface(i)/rho(j)
epsyy(j) = epsyy(j) + mass(i)*epsilonyy_interface(j)/rho(i)
epsxy(i) = epsxy(i) + mass(j)*epsilonxy_interface(i)/rho(j)
epsxy(j) = epsxy(j) + mass(i)*epsilonxy_interface(j)/rho(i)
epszz(i) = epszz(i) + mass(j)*epsilonzz_interface(i)/rho(j)
epszz(j) = epszz(j) + mass(i)*epsilonzz_interface(j)/rho(i)
epsxz(i) = epsxz(i) + mass(j)*epsilonxz_interface(i)/rho(j)
epsxz(j) = epsxz(j) + mass(i)*epsilonxz_interface(j)/rho(i)
epsyz(i) = epsyz(i) + mass(j)*epsilonyz_interface(i)/rho(j)
epsyz(j) = epsyz(j) + mass(i)*epsilonyz_interface(j)/rho(i)
endif
endif
enddo
!$omp end do nowait
endif
!$omp end parallel
The performance problem that you observe comes from the very foundation of the algorithm that you use. Each thread picks a pair of particles and computes some values, then modifies the value of eps?? (where ?? is xx, yy, zz, etc.) for both particles. Depending on how the pair list is built, this could lead to many threads trying to modify the values for neighbouring particles or even for the same particle concurrently. In the former case it results in false sharing, which presents itself as huge slowdown due to cache lines being constantly invalidated and reloaded from higher level caches or from main memory. The latter results in completely wrong values for the array elements being computed.
While the latter problem can be easily fixed by either using atomic updates, e.g.
!$OMP ATOMIC UPDATE
epszz(i) = epszz(i) + mass(j)*epsilonzz_interface(i)/rho(j)
or CRITICAL constructs, e.g.
!$OMP CRITICAL
epsxx(i) = epsxx(i) + mass(j)*epsilonxx_interface(i)/rho(j)
epsxx(j) = epsxx(j) + mass(i)*epsilonxx_interface(j)/rho(i)
epsyy(i) = epsyy(i) + mass(j)*epsilonyy_interface(i)/rho(j)
epsyy(j) = epsyy(j) + mass(i)*epsilonyy_interface(j)/rho(i)
epsxy(i) = epsxy(i) + mass(j)*epsilonxy_interface(i)/rho(j)
epsxy(j) = epsxy(j) + mass(i)*epsilonxy_interface(j)/rho(i)
epszz(i) = epszz(i) + mass(j)*epsilonzz_interface(i)/rho(j)
epszz(j) = epszz(j) + mass(i)*epsilonzz_interface(j)/rho(i)
epsxz(i) = epsxz(i) + mass(j)*epsilonxz_interface(i)/rho(j)
epsxz(j) = epsxz(j) + mass(i)*epsilonxz_interface(j)/rho(i)
epsyz(i) = epsyz(i) + mass(j)*epsilonyz_interface(i)/rho(j)
epsyz(j) = epsyz(j) + mass(i)*epsilonyz_interface(j)/rho(i)
!$OMP END CRITICAL
or even array reductions, e.g.
!$OMP PARALLEL REDUCTION(+:epsxx,epsyy,epsxy,epszz,...)
the former problem requires that you change the algorithm. For example you can switch to a different pair list structure, e.g. an array of lists, where the array index is the particle number and each list contains the neighbours of that particle. Sorting the neighbour list will (kind of) reduce the false sharing. Depending on the geometry of the particle distribution, you might end up with severely unbalanced problem, therefore you should think about using dynamic loop scheduling.