Evaluation of Fortran's SUM(..) without temporary array? - fortran

Readability-wise, I find it preferable to write
momentum = sum( [( calculateMomentum(elements(i)), i=1, size(elements,1) )] )
over
momentum = 0.0d0
do i = 1, size(elements,1)
momentum = momentum + calculateMomentum(elements(i))
end do
because the first version has the form of defining the value of momentum, while the second corresponds to a more lower-level accumulation instruction. The difference becomes more pronounced in real-world code with more arguments and possibly multiple indices.
However, the first version allocates a temporary array. As a human programmer I know, that it could be optimized away, so I was wondering if Fortran offers a syntax, that allows calculating the sum with neither an explicit loop nor a temporary array.
Update
No such construct. It looks like there is no syntax, as what I was asking for. See Vladimir F's answer.
It matters, but less than I thought. I've made my own benchmark (pastebin, embedded) using matrix multiplication using several variants.
C(i,j) = C(i,j) + A(i,k) * B(k,j) was the slowest, probably due to the unnecessary array access in each step.
C(i,j) = sum( [(A(i,k) * B(k,j), k = 1, N)] )` was about 10-20% faster than (1), despite the temporary array.
tmp = tmp + A(i,k) * B(k,j), i.e. using a temporary accumulator variable, was about 20% faster than (2).
C = matmul(A,B) was the fastest by far, compared to (3) by a factor of 25 for 500x500 matrices, growing to 50 for 2000x2000, while the relative speed of the other variants stayed roughly the same.
Bottom line: When the task cannot be expressed in optimized library- or intrinsic functions easily, the sum variant has viable performance, and should only be optimized away, if performance really matters to such a degree in that part of the code.

Whether a temporary array will be allocated or net depends on the optimizations in the compiler. Stack allocation is almost free anyway. Copying the values will probably take longer.
The compiler may optimize unnecessary steps away if it can make sure the result will be the same. However, there is no special syntax for that. Fortran typically tries to stay far from the actual implementation and leaves a lot on the compiler.

For experiment, I've tried this code (which computes the sum of inverse of arr).
program main
use iso_fortran_env, only: dp => real64
implicit none
real(dp) val
real(dp), allocatable :: arr(:)
integer num, loop, i, t1, t2, trate
num = 10**8
arr = [( i, i = 1, num )] !! L1
do loop = 1, 10
call system_clock( t1 )
val = sum( [( testfunc( arr(i) ), i = 1, num )] ) !! L2
call system_clock( t2, trate )
print *, "val = ", val, " in ", (t2 - t1) / real(trate), " (s)"
enddo
contains
function testfunc( x ) result( ret )
real(dp), intent(in) :: x
real(dp) :: ret
ret = 1.0_dp / x
end
end program
Then, on my computer (mac2012), "gfortran-10 -O2 test.f90 && time ./a.out" gives
val = 18.997896413852555 in 1.02999997 (s)
val = 18.997896413852555 in 1.10099995 (s)
val = 18.997896413852555 in 1.17600000 (s)
...
real 0m12.575s
user 0m8.142s
sys 0m4.387s
and "gfortran-10 -O3" gives
val = 18.997896413852555 in 0.875000000 (s)
val = 18.997896413852555 in 0.888000011 (s)
val = 18.997896413852555 in 0.833000004 (s)
...
real 0m9.986s
user 0m5.738s
sys 0m4.210s
In both cases, the htop command shows ~1.5 GB allocated, which may be reasonable if lines L1 and L2 use a temporary array (each ~800 MB with ~0.3 s for allocation).
Because there is no syntax for creating "iterators", I've tried making testfunc() to be elemental (or impure elemental). The only difference here is the lines marked with <--.
program main
use iso_fortran_env, only: dp => real64
implicit none
real(dp) val
real(dp), allocatable :: arr(:)
integer num, loop, i, t1, t2, trate
num = 10**8
arr = [( i, i = 1, num )]
do loop = 1, 10
call system_clock( t1 )
val = sum( testfunc( arr ) ) !<--
call system_clock( t2, trate )
print *, "val = ", val, " in ", (t2 - t1) / real(trate), " (s)"
enddo
contains
impure elemental & !<--
function testfunc( x ) result( ret )
real(dp), intent(in) :: x
real(dp) :: ret
ret = 1.0_dp / x
end
end program
Then, "gfortran-10 -O2" gives
val = 18.997896413852555 in 0.437000006 (s)
val = 18.997896413852555 in 0.453999996 (s)
val = 18.997896413852555 in 0.437999994 (s)
...
real 0m5.946s
user 0m5.069s
sys 0m0.842s
and "gfortran-10 -O3" gives
val = 18.997896413852555 in 0.225999996 (s)
val = 18.997896413852555 in 0.252000004 (s)
val = 18.997896413852555 in 0.246999994 (s)
...
real 0m3.909s
user 0m3.009s
sys 0m0.867s
The htop command show ~800 MB, so it seems only arr is allocated.
For comparison, the following code calculates val with an explicit do-loop (using a scalar version of testfunc())
val = 0
do i = 1, num
val = val + testfunc( arr(i) )
enddo
which gave the same timing with the second code with elemental + testfunc(arr) above (on my mac).
In all the above code, the -Ofast option resulted in a runtime error (Illegal instruction). But this was due to the line L1 (arr = [( i, i = 1, num )]). If I allocate arr beforehand and populate it with an explicit loop, -Ofast also worked without problem (giving almost the same timing with -O3 in this case).

If you make calculateMomentum and elemental function, then it can be used for both scalar values and for arrays
for example with:
elemental function calculateMomentum(obj, v) result(p)
class(body), intent(in) :: obj
real, intent(in) :: v
real :: p
p = obj%mass * v
end function
you can apply the above to an array of v
integer::i
type(body) :: ball
real, allocatable :: v(:), p(:)
real :: tot_p
allocate(v(10))
v = [ (10+i, i=1, 10) ]
p = calculateMomentum(ball, v)
tot_p = sum(p)
having an intermediate array to hold the values is advantageous because it keeps the data close by (probably within the cache-line) and the sum() function would be as quick as it can be.
imagine the type body as follows for example
type body
real :: mass
end type

Related

How can I reduce floating-point error in a cubic equation solver?

I'm using the following algorithm to solve a cubic polynomial equation (x^3 + ax^2 + bx + c = 0):
function find_roots(a, b, c, lower_bound, upper_bound)
implicit none
real*8, intent(in) :: a, b, c, lower_bound, upper_bound
real*8 :: find_roots
real*8 :: Q, R, theta, x, Au, Bu
integer :: i, iter
Q = (a**2 - 3.D0*b)/9.D0
R = (2.D0*a**3 - 9.D0*a*b + 27.D0*c)/54.D0
!If roots are all real, get root in range
if (R**2.lt.Q**3) then
iter = 0
theta = acos(R/sqrt(Q**3))
!print *, "theta = ", theta
do i=-1,1
iter = iter+1
x = -2.D0*sqrt(Q)*cos((theta + dble(i)*PI*2.D0)/3.D0)-a/3.D0
!print *, "iter = ", iter, "root = ", x
if ((x.ge.lower_bound).and.(x.le.upper_bound)) then
find_roots = x
return
end if
end do
!Otherwise, two imaginary roots and one real root, return real root
else
Au = -sign(1.D0, R)*(abs(R)+sqrt(R**2-Q**3))**(1.D0/3.D0)
if (Au.eq.0.D0) then
Bu = 0.D0
else
Bu = Q/Au
end if
find_roots = (Au+Bu)-a/3.D0
return
end if
end function find_roots
Now it turns out that it can be shown analytically that a cubic equation with the following inputs:
Q0 = 1.D0
alpha = 1.D-2
dt = 0.00001D0
Y = 1000000.D0
find_roots(-(2.D0*Q0+Y), &
-(alpha-Q0**2-2.D0*Y*Q0+dt/2.D0*alpha), &
(dt/2.D0*alpha*Q0+Y*alpha-Y*Q0**2), &
Q0-sqrt(alpha), &
Q0+sqrt(alpha)))
MUST have a root between Q0+sqrt(alpha) and Q0-sqrt(alpha). This is a mathematical certainty. However, the function as called above will return 0, not the correct root, due to floating-point error, since the required result is very close to Q0+sqrt(alpha). I've confirmed this by creating a new function which uses quadruple precision. Unfortunately, I can't just always use quadruple precision since this function will be called billions of times and is a performance bottleneck.
So my question is, are there any general ways I could re-write this code to reduce these precision errors, while also maintaining the performance? I tried using the algorithm suggested by wikipedia, but the problem actually got worse.
https://www.cliffsnotes.com/study-guides/algebra/algebra-ii/factoring-polynomials/sum-or-difference-of-cubes
This should reduce rounding error.
Likewise, you should be able to find a much better grouping of terms, where you don't make the compiler guess what you want,
https://en.wikipedia.org/wiki/Horner%27s_method
alpha-Q0**2-2.D0*Y*Q0+dt/2.D0*alpha /= (alpha+alpha*.5*dt)-Q0*(Q0+2*Y)
You might argue that any good optimizer should know what to do with .5dt vs. dt/2. ifort considers that a part of -no-prec-div even though it can't change roundoff.
It's up to you whether you choose single precision constants for readability after checking to make sure that the promotion rules cause them to promote exactly to double. It seems particularly bad style to depend on f77 D0 suffix to choose the same data type as the never-standard real*8; no doubt it does if your compiler doesn't complain.
There is something wrong with the accuracy of your calculations, either the calculation of a,b,c or the find_roots function estimates.
I used the a,b,c that are calculated and found that your lower_bound and upper_bound were better estimates of the roots.
I then modified the bounds to be +/- sqrt(alpha)*1.1 so that the range test would work for 64-bit.
I also simplified constants that promote exactly to double.
Finally I compared your estimate of the root to the fn (0.9d0) and fn (1.1d0), which shows the find_roots function does not work for the a,b,c provided.
You should check your references for the error or it may just be the approach fails when acos (+/- 1.0 ) is used.
The program I used to test this with lots of prints is:
real*8 function find_roots (a, b, c, lower_bound, upper_bound)
implicit none
real*8, intent(in) :: a, b, c, lower_bound, upper_bound
real*8 :: Q, R, theta, x, Au, Bu, thi
integer :: i, iter
real*8 :: two_pi ! = 8 * atan (1.0d0)
Q = (a**2 - 3.*b)/9.
R = (2.*a**3 - 9.*a*b + 27.*c)/54.
two_pi = 8 * atan (1.0d0)
!If roots are all real, get root in range
if (R**2 < Q**3) then
iter = 0
x = R/sqrt(Q**3)
theta = acos(x)
print *, "theta = ", theta, x
do i=-1,1
iter = iter+1
!! x = -2.D0*sqrt(Q)* cos((theta + dble(i)*PI*2.D0)/3.D0) - a/3.D0
thi = (theta + i*two_pi)/3.
x = -2.*sqrt(Q) * cos (thi) - a/3.
!print *, "iter = ", iter, "root = ", x
if ( (x >= lower_bound) .and. (x <= upper_bound) ) then
find_roots = x
print *, "find_roots = ", x
! return
end if
end do
!Otherwise, two imaginary roots and one real root, return real root
else
Au = -sign(1.D0, R)*(abs(R)+sqrt(R**2-Q**3))**(1.D0/3.D0)
if (Au.eq.0.D0) then
Bu = 0.D0
else
Bu = Q/Au
end if
find_roots = (Au+Bu)-a/3.D0
return
end if
end function find_roots
real*8 function get_cubic (x, a, b, c)
implicit none
real*8, intent(in) :: x, a, b, c
get_cubic = ( ( x + a) * x + b ) * x + c
end function get_cubic
! Now it turns out that it can be shown analytically that a cubic equation with the following inputs:
real*8 Q0, alpha, dt, Y, a, b, c, lower_bound, upper_bound, val, fn
real*8, external :: find_roots, get_cubic
!
Q0 = 1.D0
alpha = 1.0D-2
dt = 0.00001D0
Y = 1000000.0D0
!
a = -(2.*Q0 + Y)
b = -(alpha - Q0**2 - 2.*Y*Q0 + dt/2.*alpha)
c = (dt/2.*alpha*Q0 + Y*alpha - Y*Q0**2)
write (*,*) a,b,c
!
lower_bound = Q0-sqrt(alpha)*1.1
upper_bound = Q0+sqrt(alpha)*1.1
write (*,*) lower_bound, upper_bound
!
val = find_roots (a, b, c, lower_bound, upper_bound)
!
fn = get_cubic ( val, a,b,c )
write (*,*) val, fn
!
! Test the better root values
val = 0.9d0
fn = get_cubic ( val, a,b,c )
write (*,*) val, fn
!
val = 1.1d0
fn = get_cubic ( val, a,b,c )
write (*,*) val, fn
end

Storing a Variable with a Multi-Dimensional Index in Fortran

Question
Consider the following code:
program example
implicit none
integer, parameter :: n_coeffs = 1000
integer, parameter :: n_indices = 5
integer :: i
real(8), dimension(n_coeffs) :: coeff
integer, dimension(n_coeffs,n_indices) :: index
do i = 1, n_coeffs
coeff(i) = real(i*3,8)
index(i,:) = [2,4,8,16,32]*i
end do
end
For any 5 dimensional index I need to obtain the associated coefficient, without knowing or calculating i. For instance, given [2,4,8,16,32] I need to obtain 3.0 without computing i.
Is there a reasonable solution, perhaps using sparse matrices, that would work for n_indices in the order of 100 (though n_coeffs still in the order of 1000)?
A Bad Solution
One solution would be to define a 5 dimensional array as in
real(8), dimension(2000,4000,8000,16000,32000) :: coeff2
do i = 1, ncoeffs
coeff2(index(i,1),index(i,2),index(i,3),index(i,4),index(i,5)) = coeff(i)
end do
then, to get the coefficient associated with [2,4,8,16,32], call
coeff2(2,4,8,16,32)
However, besides being very wasteful of memory, this solution would not allow n_indices to be set to a number higher than 7 given the limit of 7 dimensions to an array.
OBS: This question is a spin-off of this one. I have tried to ask the question more precisely having failed in the first attempt, an effort that greatly benefited from the answer of #Rodrigo_Rodrigues.
Actual Code
In case it helps here is the code for the actual problem I am trying to solve. It is an adaptive sparse grid method for approximating a function. The main goal is to make the interpolation at the and as fast as possible:
MODULE MOD_PARAMETERS
IMPLICIT NONE
SAVE
INTEGER, PARAMETER :: d = 2 ! number of dimensions
INTEGER, PARAMETER :: L_0 = 4 ! after this adaptive grid kicks in, for L <= L_0 usual sparse grid
INTEGER, PARAMETER :: L_max = 9 ! maximum level
INTEGER, PARAMETER :: bound = 0 ! 0 -> for f = 0 at boundary
! 1 -> adding grid points at boundary
! 2 -> extrapolating close to boundary
INTEGER, PARAMETER :: max_error = 1
INTEGER, PARAMETER :: L2_error = 1
INTEGER, PARAMETER :: testing_sample = 1000000
REAL(8), PARAMETER :: eps = 0.01D0 ! epsilon for adaptive grid
END MODULE MOD_PARAMETERS
PROGRAM MAIN
USE MOD_PARAMETERS
IMPLICIT NONE
INTEGER, DIMENSION(d,d) :: ident
REAL(8), DIMENSION(d) :: xd
INTEGER, DIMENSION(2*d) :: temp
INTEGER, DIMENSION(:,:), ALLOCATABLE :: grid_index, temp_grid_index, grid_index_new, J_index
REAL(8), DIMENSION(:), ALLOCATABLE :: coeff, temp_coeff, J_coeff
REAL(8) :: temp_min, temp_max, V, T, B, F, x1
INTEGER :: k, k_1, k_2, h, i, j, L, n, dd, L1, L2, dsize, count, first, repeated, add, ind
INTEGER :: time1, time2, clock_rate, clock_max
REAL(8), DIMENSION(L_max,L_max,2**(L_max),2**(L_max)) :: coeff_grid
INTEGER, DIMENSION(d) :: level, LL, ii
REAL(8), DIMENSION(testing_sample,d) :: x_rand
REAL(8), DIMENSION(testing_sample) :: interp1, interp2
! ============================================================================
! EXECUTABLE
! ============================================================================
ident = 0
DO i = 1,d
ident(i,i) = 1
ENDDO
! Initial grid point
dsize = 1
ALLOCATE(grid_index(dsize,2*d),grid_index_new(dsize,2*d))
grid_index(1,:) = 1
grid_index_new = grid_index
ALLOCATE(coeff(dsize))
xd = (/ 0.5D0, 0.5D0 /)
CALL FF(xd,coeff(1))
CALL FF(xd,coeff_grid(1,1,1,1))
L = 1
n = SIZE(grid_index_new,1)
ALLOCATE(J_index(n*2*d,2*d))
ALLOCATE(J_coeff(n*2*d))
CALL SYSTEM_CLOCK (time1,clock_rate,clock_max)
DO WHILE (L .LT. L_max)
L = L+1
n = SIZE(grid_index_new,1)
count = 0
first = 1
DEALLOCATE(J_index,J_coeff)
ALLOCATE(J_index(n*2*d,2*d))
ALLOCATE(J_coeff(n*2*d))
J_index = 0
J_coeff = 0.0D0
DO k = 1,n
DO i = 1,d
DO j = 1,2
IF ((bound .EQ. 0) .OR. (bound .EQ. 2)) THEN
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(grid_index_new(k,d+i)-(-1)**j)/)
ELSEIF (bound .EQ. 1) THEN
IF (grid_index_new(k,i) .EQ. 1) THEN
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(-(-1)**j)/)
ELSE
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(grid_index_new(k,d+i)-(-1)**j)/)
ENDIF
ENDIF
CALL XX(d,temp(1:d),temp(d+1:2*d),xd)
temp_min = MINVAL(xd)
temp_max = MAXVAL(xd)
IF ((temp_min .GE. 0.0D0) .AND. (temp_max .LE. 1.0D0)) THEN
IF (first .EQ. 1) THEN
first = 0
count = count+1
J_index(count,:) = temp
V = 0.0D0
DO k_1 = 1,SIZE(grid_index,1)
T = 1.0D0
DO k_2 = 1,d
CALL XX(1,temp(k_2),temp(d+k_2),x1)
CALL BASE(x1,grid_index(k_1,k_2),grid_index(k_1,k_2+d),B)
T = T*B
ENDDO
V = V+coeff(k_1)*T
ENDDO
CALL FF(xd,F)
J_coeff(count) = F-V
ELSE
repeated = 0
DO h = 1,count
IF (SUM(ABS(J_index(h,:)-temp)) .EQ. 0) THEN
repeated = 1
ENDIF
ENDDO
IF (repeated .EQ. 0) THEN
count = count+1
J_index(count,:) = temp
V = 0.0D0
DO k_1 = 1,SIZE(grid_index,1)
T = 1.0D0
DO k_2 = 1,d
CALL XX(1,temp(k_2),temp(d+k_2),x1)
CALL BASE(x1,grid_index(k_1,k_2),grid_index(k_1,k_2+d),B)
T = T*B
ENDDO
V = V+coeff(k_1)*T
ENDDO
CALL FF(xd,F)
J_coeff(count) = F-V
ENDIF
ENDIF
ENDIF
ENDDO
ENDDO
ENDDO
ALLOCATE(temp_grid_index(dsize,2*d))
ALLOCATE(temp_coeff(dsize))
temp_grid_index = grid_index
temp_coeff = coeff
DEALLOCATE(grid_index,coeff)
ALLOCATE(grid_index(dsize+count,2*d))
ALLOCATE(coeff(dsize+count))
grid_index(1:dsize,:) = temp_grid_index
coeff(1:dsize) = temp_coeff
DEALLOCATE(temp_grid_index,temp_coeff)
grid_index(dsize+1:dsize+count,:) = J_index(1:count,:)
coeff(dsize+1:dsize+count) = J_coeff(1:count)
dsize = dsize + count
DO i = 1,count
coeff_grid(J_index(i,1),J_index(i,2),J_index(i,3),J_index(i,4)) = J_coeff(i)
ENDDO
IF (L .LE. L_0) THEN
DEALLOCATE(grid_index_new)
ALLOCATE(grid_index_new(count,2*d))
grid_index_new = J_index(1:count,:)
ELSE
add = 0
DO h = 1,count
IF (ABS(J_coeff(h)) .GT. eps) THEN
add = add + 1
J_index(add,:) = J_index(h,:)
ENDIF
ENDDO
DEALLOCATE(grid_index_new)
ALLOCATE(grid_index_new(add,2*d))
grid_index_new = J_index(1:add,:)
ENDIF
ENDDO
CALL SYSTEM_CLOCK (time2,clock_rate,clock_max)
PRINT *, 'Elapsed real time1 = ', DBLE(time2-time1)/DBLE(clock_rate)
PRINT *, 'Grid Points = ', SIZE(grid_index,1)
! ============================================================================
! Compute interpolated values:
! ============================================================================
CALL RANDOM_NUMBER(x_rand)
CALL SYSTEM_CLOCK (time1,clock_rate,clock_max)
DO i = 1,testing_sample
V = 0.0D0
DO L1=1,L_max
DO L2=1,L_max
IF (L1+L2 .LE. L_max+1) THEN
level = (/L1,L2/)
T = 1.0D0
DO dd = 1,d
T = T*(1.0D0-ABS(x_rand(i,dd)/2.0D0**(-DBLE(level(dd)))-DBLE(2*FLOOR(x_rand(i,dd)*2.0D0**DBLE(level(dd)-1))+1)))
ENDDO
V = V + coeff_grid(L1,L2,2*FLOOR(x_rand(i,1)*2.0D0**DBLE(L1-1))+1,2*FLOOR(x_rand(i,2)*2.0D0**DBLE(L2-1))+1)*T
ENDIF
ENDDO
ENDDO
interp2(i) = V
ENDDO
CALL SYSTEM_CLOCK (time2,clock_rate,clock_max)
PRINT *, 'Elapsed real time2 = ', DBLE(time2-time1)/DBLE(clock_rate)
END PROGRAM
For any 5 dimensional index I need to obtain the associated
coefficient, without knowing or calculating i. For instance, given
[2,4,8,16,32] I need to obtain 3.0 without computing i.
function findloc_vector(matrix, vector) result(out)
integer, intent(in) :: matrix(:, :)
integer, intent(in) :: vector(size(matrix, dim=2))
integer :: out, i
do i = 1, size(matrix, dim=1)
if (all(matrix(i, :) == vector)) then
out = i
return
end if
end do
stop "No match for this vector"
end
And that's how you use it:
print*, coeff(findloc_vector(index, [2,4,8,16,32])) ! outputs 3.0
I must confess I was reluctant to post this code because, even though this answers your question, I honestly think this is not what you really want/need, but you dind't provide enough information for me to know what you really do want/need.
Edit (After actual code from OP):
If I decrypted your code correctly (and considering what you said in your previous question), you are declaring:
REAL(8), DIMENSION(L_max,L_max,2**(L_max),2**(L_max)) :: coeff_grid
(where L_max = 9, so size(coeff_grid) = 21233664 =~160MB) and then populating it with:
DO i = 1,count
coeff_grid(J_index(i,1),J_index(i,2),J_index(i,3),J_index(i,4)) = J_coeff(i)
ENDDO
(where count is of the order of 1000, i.e. 0.005% of its elements), so this way you can fetch the values by its 4 indices with the array notation.
Please, don't do that. You don't need a sparse matrix in this case either. The new approach you proposed is much better: storing the indices in each row of an smaller array, and fetching on the array of coefficients by the corresponding location of those indices in its own array. This is way faster (avoiding the large allocation) and much more memory-efficient.
PS: Is it mandatory for you to stick to Fortran 90? Its a very old version of the standard and chances are that the compiler you're using implements a more recent version. You could improve the quality of your code a lot with the intrinsic move_alloc (for less array copies), the kind constants from the intrinsic module iso_fortran_env (for portability), the [], >, <, <=,... notation (for readability)...

What is wrong with my sub-function definition of sinc?

implicit none
character*20 fflname,oflname
integer length_sgnl
real*8 pi, dt, m, n, theta
parameter ( length_sgnl=11900, dt=0.01d0, m=1, n=1, pi=3.1416
& ,theta=0.2 )
integer i
complex*16 cj, coeff ,sgnl(1 : length_sgnl)
real*8 t(1 : length_sgnl)
parameter ( cj = dcmplx(0, 1) )
real*8 time, real_sgnl, imag_sgnl
oflname="filtered.data"
fflname="artificial"
open(11, file = oflname)
do i=1, length_sgnl
read(11, *) time, real_sgnl, imag_sgnl
sgnl(i) = dcmplx(real_sgnl, imag_sgnl)
t(i) = (i*dt - m) / (2**n)
enddo
coeff = 0
do i=1, length_sgnl
coeff = coeff
& + sgnl(i) * sinc (t(i)) * exp (-cj*2*pi*t(i))
enddo
do i=1, length_sgnl
sgnl(i) = sgnl(i)
& - coeff * sinc (t(i)) * exp (-cj*2*pi*t(i))
& + coeff * sinc (t(i)) * exp (-cj*2*pi*t(i))
& * exp (cj*theta)
enddo
open(12, file = fflname)
do i=1, length_sgnl
write(12, *) i*dt, sgnl(i)
enddo
close(12)
real*8 function sinc (a)
real*8 :: sinc, a
if (abs(a) < 1.0d-6) then
sinc = 1
else
sinc = sin(pi*a) / (pi*a)
end if
end function
stop
end
At the last part of a sub-defined function sinc, I assume the problem is there but I am not sure what it is exactly. The gfortran noticed that I did not define sinc and a, and the "end function" should be "end program"?
I have tried to update your program into standards-compliant modern Fortran:
program sinctest
use :: iso_fortran_env
implicit none
! Declare parameters
integer, parameter :: length_sgnl=11900
real(real64), parameter :: pi=3.1416, dt=0.01, m=1, n=1, theta=0.2
complex(real64), parameter :: cj = cmplx(0, 1)
! Declare variables
character(len=20) :: fflname, oflname
complex(real64) :: coeff, sgnl(length_sgnl)
real(real64) :: time, real_sgnl, imag_sgnl, t(length_sgnl)
integer :: i, ofl, ffl
! Define filenames
oflname="filtered.data"
fflname="artificial"
! Read the input file
open(newunit = ofl, file = oflname)
do i=1, length_sgnl
read(ofl, *) time, real_sgnl, imag_sgnl
sgnl(i) = cmplx(real_sgnl, imag_sgnl, kind=real64)
t(i) = (i*dt - m) / (2**n)
end do
close(ofl)
! Process the input signal
coeff = 0
do i=1, length_sgnl
coeff = coeff &
+ sgnl(i) * sinc(t(i)) * exp(-cj*2*pi*t(i))
end do
do i=1, length_sgnl
sgnl(i) = sgnl(i) &
- coeff * sinc(t(i)) * exp(-cj*2*pi*t(i)) &
+ coeff * sinc(t(i)) * exp(-cj*2*pi*t(i)) &
* exp(cj*theta)
end do
! Save the output file
open(newunit = ffl, file = fflname)
do i=1, length_sgnl
write(ffl, *) i*dt, sgnl(i)
enddo
close(ffl)
contains
pure function sinc(a) result(r)
! This function calculates sinc(a)=sin(pi*a)/(pi*a).
real(real64), intent(in) :: a
real(real64) :: r
if (abs(a) < 1.0e-6) then
r = 1
else
r = sin(pi*a) / (pi*a)
end if
end function
end program
To compile it using e.g. GFortran:
gfortran -std=f2008 -ffree-form sinctest.f
These are the syntax errors that I fixed:
Added a contains section before defining your sinc-function;
Moved your continuation characters (&) from the beginning of a continued line to the end of the previous line;
These are not required changes, just merely style suggestions:
Used the intrinsic module iso_fortran_env to get the real64 variable, which lets you define variables as real(real64) instead of real*8, as the former is portable while the latter is not;
Merged the specification of the variable type (e.g. real) and parameter into a single lines;
Used the Fortran2008 newunit argument to open instead of hard-coding in unit numbers, as this saves you some headache if you write large programs and have a modern compiler;
Made sure that you close the input file as well;
Declared your sinc-function to be pure, as it has no side-effects;
Used the result notation for your sinc-function, so that you don't have to specify the type real*8 in front of the function name;
Rewrote the program in the form program...end program instead of ...stop end.
EDIT:
I also wanted to note that using modern Fortran, the math itself can be written considerably more consise using 'array notation' and 'elemental functions'. For instance, if you define your sinc-function:
elemental function sinc(a) result(r)
! This function calculates sinc(a)=sin(pi*a)/(pi*a).
real(real64), intent(in) :: a
real(real64) :: r
if (abs(a) < 1.0e-6) then
r = 1
else
r = sin(pi*a) / (pi*a)
end if
end function
Then the elemental keyword says that if you apply the sinc-function to an array, it should return a new array where the sinc-function has been evaluated for each element. So this piece of code:
coeff = 0
do i=1, length_sgnl
coeff = coeff &
+ sgnl(i) * sinc(t(i)) * exp(-cj*2*pi*t(i))
end do
Can then actually be written as a one-liner:
coeff = sum(sgnl * sinc(t) * exp(-2*pi*cj*t))
So I would highly recommend that you look into the modern array notation too :).
EDIT 2:
Tried to emphasize what changes are relevant to fixing errors, and what changes are just style suggestions (thanks Vladimir F).

Wrong result when running code in parallel

The gfortran compiler gives wrong answer, when I run a parallel program using OpenMP. In the same time, ifort gives exact result.
This is the whole compilable code.
!_______________________________________________________________ !
!____________MODULE SECTION_____________________________________ !
MODULE MATRIC
IMPLICIT NONE
INTEGER , PARAMETER :: NG = 40
DOUBLE PRECISION,SAVE :: Z , PA , PB ,CMU
DOUBLE PRECISION , PARAMETER :: PI=2.0D0*ACOS(0.0D0) , &
FPI=4.0D0*PI , SQFPI = SQRT(FPI), DLAM=1.0D0
DOUBLE PRECISION , DIMENSION(450) :: DEL1, DEL2, X, R , SNLO
DOUBLE PRECISION :: XG(60) , WG(60)
END MODULE MATRIC
!_________________________________________________________________________!
! MODULE SECTION
!__________________________________________________________________________!
MODULE POTDATA
IMPLICIT NONE
INTEGER :: IA , IB , ID
DOUBLE PRECISION :: RA , RB , R1s(450)
END MODULE POTDATA
!__________________________________________________________________________!
!______________________________________________________________________!
program check
use matric
use potdata
implicit double precision(a-h,o-z)
pa = 0.72D0 ; pb = 0.19D0
mesh = 441 ; noint= 40 ; z = 2.0d0
CALL GAULEG(-1.d0,1.d0)
NB = MESH/NOINT
I = 1
X(I) = 0.0D+00
DELTAX = 0.0025D+00*40.0D+00/DBLE(NOINT)
DO J=1,NB
IMK = (J-1)*NOINT + 1
DO K=1,NOINT
AK=K
I=I+1
X(I)=X(IMK)+AK*DELTAX
END DO
DELTAX=2.0D+00*DELTAX
END DO
CMU=(9.0D00*PI*PI/(128.0D00*Z))**THIRD
R(1)=0.0D+00 ; SNLO(1) = 0.D00
DO I=2,MESH
R(I)=CMU*X(I)
SNLO(I) = R(I)*dexp(-Z*R(I))
R1S(I) = SNLO(I)/(SQFPI*R(I))
END DO
call EFFPOT(MESH,NOINT)
end program check
subroutine EFFPOT(MESH,NOINT)
USE OMP_LIB
USE MATRIC
USE POTDATA
implicit none
integer, intent(in) :: MESH, NOINT
double precision::anorm(450)
double precision, external :: funct
double precision :: asum, fac, cnorm
!$omp parallel do default(none) private(del1,ia,asum,ib,ra,rb,fac) &
!$omp shared(id,mesh,r,anorm,NOINT,del2,R1s)
do ia = 2,mesh
ra = r(ia)
if(R1s(ia).lt.1.D-7.and.R1s(ia).ge.1.D-8)id = ia
do ib = 2,mesh
rb = r(ib)
call QGAUSS(funct,-1.d0,1.d0,fac)
del1(ib) = rb**2*fac*R1s(ib)**2
end do
CALL NCDF(del1,ASUM,r(2),mesh,NOINT)
anorm(ia) = 2.0d0*pi*asum
del2(ia) = 2.0d0*pi*asum*(ra*R1s(ia))**2
end do
!$omp end parallel do
CALL NCDF(del2,ASUM,r(2),mesh,NOINT)
cnorm = 1.0/dsqrt(4.*pi*ASUM)
write(6,*)'cnorm =',cnorm
return
end
double precision function funct(x)
USE POTDATA , ONLY : RA , RB
USE MATRIC , ONLY : PA , PB , DLAM
implicit none
double precision, intent(in) :: x
double precision :: f1, f2, ramrb
ramrb = dsqrt(ra**2+rb**2-2.d0*ra*rb*x)
f1 = dcosh(pa*ra)+dcosh(pa*rb)
f2 = 1.d0+0.5*dlam*ramrb*dexp(-pb*ramrb)
funct = (f1*f2)**2
return
end
SUBROUTINE QGAUSS(func,aa,bb,ss)
USE OMP_LIB
USE MATRIC , ONLY : XG ,WG , NG
IMPLICIT DOUBLE PRECISION(A-H,O-Z)
external func
xm = 0.5d0*(bb+aa)
xl = 0.5d0*(bb-aa)
ss = 0.d0
do j=1,ng
dx = xl*xg(j)
ss = ss + wg(j)*(func(xm+dx)+func(xm-dx))
end do
ss = xl*ss/2.0
return
END
SUBROUTINE GAULEG(x1,x2)
USE MATRIC , ONLY : XG ,WG ,NG , PI
IMPLICIT DOUBLE PRECISION(A-H,O-Z)
eps = 1.d-14
m = (ng+1)/2
xm = 0.5D0*(x1+x2)
xl = 0.5D0*(x2-x1)
do i=1,m
z = dcos(pi*(dfloat(i)-0.25d0)/(dfloat(ng)+0.5d0))
1 continue
p1 = 1.d0
p2 = 0.d0
do j=1,ng
p3 = p2
p2 = p1
p1 = ((2.d0*dfloat(j)-1.d0)*z*p2 &
- (dfloat(j)-1.d0)*p3)/dfloat(j)
end do
pp = dfloat(ng)*(z*p1-p2)/(z*z-1.d0)
z1 = z
z = z1 - p1/pp
if (dabs(z-z1).gt.eps) go to 1
xg(i) = xm - xl*z
xg(ng+1-i) = xm + xl*z
wg(i) = 2.d0*xl/((1.d0-z*z)*pp*pp)
wg(ng+1-i) = wg(i)
end do
return
end
SUBROUTINE NCDF(F,ASUM,H,KKK,NOINT)
IMPLICIT DOUBLE PRECISION (A-H,O-Z)
DIMENSION F(450)
NBLOCK = (KKK-2)/NOINT + 1
C2HO45 = 2.0D+00*H/45.0D+00
ASUM = 0.0D+00
DO J=1,NBLOCK
ISTAR = NOINT*(J-1)+5
IEND = NOINT*J + 1
IEND = MIN0(KKK,IEND)
DO I=ISTAR,IEND,4
ASUM = ASUM + C2HO45*(7.0D+00*(F(I-4)+F(I)) &
+32.0D+00*(F(I-3)+F(I-1)) + 12.0D+00*F(I-2))
END DO
IF(IEND.EQ.KKK) GO TO 4
C2HO45 = 2.0D+00*C2HO45
4 END DO
RETURN
END
Thanks everybody specially #Vladimir who has taken interest in my problem. Finally i got the right answer by removing ra and rb from the module potdata and defined function as funct(x, ra, rb) and then removing ra and rb from the loop. Because i was writing ra, rb then reading their values in the above code so loop was having flow dependence. Now i get exact result from both compiler (which is 8.7933767516) parallelly, sequentially both. Exact way is this
subroutine EFFPOT(MESH,NOINT)
USE OMP_LIB
USE MATRIC
USE POTDATA
implicit none
integer, intent(in) :: MESH, NOINT
double precision::anorm(450)
double precision, external :: funct
double precision :: asum, fac, cnorm
!$omp parallel do default(none) private(del1,ia,asum,ib,fac) &
!$omp shared(id,mesh,r,anorm,NOINT,del2,R1s)
do ia = 2,mesh
if(R1s(ia).lt.1.D-7.and.R1s(ia).ge.1.D-8)id = ia
do ib = 2,mesh
call QGAUSS(funct,-1.d0,1.d0,fac,r(ia),r(ib))
del1(ib) = r(ib)**2*fac*R1s(ib)**2
end do
CALL NCDF(del1,ASUM,r(2),mesh,NOINT)
anorm(ia) = 2.0d0*pi*asum
del2(ia) = 2.0d0*pi*asum*(r(ia)*R1s(ia))**2
end do
!$omp end parallel do
CALL NCDF(del2,ASUM,r(2),mesh,NOINT)
cnorm = 1.0/dsqrt(4.*pi*ASUM)
write(6,*)'cnorm =',cnorm
return
end
double precision function funct(x,ra,rb)
USE MATRIC , ONLY : PA , PB , DLAM
implicit none
double precision, intent(in) :: x, ra, rb
double precision :: f1, f2, ramrb
ramrb = dsqrt(ra**2+rb**2-2.d0*ra*rb*x)
f1 = dcosh(pa*ra)+dcosh(pa*rb)
f2 = 1.d0+0.5*dlam*ramrb*dexp(-pb*ramrb)
funct = (f1*f2)**2
return
end
SUBROUTINE QGAUSS(func,aa,bb,ss,ra,rb)
USE OMP_LIB
USE MATRIC , ONLY : XG ,WG , NG
IMPLICIT DOUBLE PRECISION(A-H,O-Z)
external func
xm = 0.5d0*(bb+aa)
xl = 0.5d0*(bb-aa)
ss = 0.d0
do j=1,ng
dx = xl*xg(j)
ss = ss + wg(j)*(func(xm+dx,ra,rb)+func(xm-dx,ra,rb))
end do
ss = xl*ss/2.0
return
END
The cause of your problem is that the OpenMP standard does not specify what happens if a private list item is accessed in the region but outside of the construct. See example private.2f (found on page 135 of the OpenMP standard supplement) for a short version of the same problem.
Specifically, the module variables ra and rb are declared private in the OpenMP parallel region inside EFFPOT and also accessed from funct. funct is in the dynamic scope of the parallel region but (lexically) outside of it and therefore it is not specified whether ra and rb referenced by funct are the original module variables or their private copies (most compilers would go for the original variables).
You have already found one of the solutions. The other one would be to declare ra and rb threadprivate since they are only used to pass data from EFFPOT to funct:
MODULE POTDATA
IMPLICIT NONE
INTEGER :: IA , IB , ID
DOUBLE PRECISION :: RA , RB , R1s(450)
!$OMP THREADPRIVATE(RA,RB)
END MODULE POTDATA
You should then also remove ra and rb from the list of the private clause of the parallel region within EFFPOT.
On some platforms, e.g. OS X, using threadprivate with GCC (i.e. gfortran) could be slower than actually passing around the two variables as arguments because of the emulated TLS.
Note that this semantic error is really hard to detect and many OpenMP tools won't actually spot it.
First of all, it is very difficult to say something specific without seeing the actual code. However, I do have some comments on your situation and the conclusions you are drawing.
The fact that your program runs fine both in parallel and sequential execution when compiled with "ifort" doesn't mean that your program is correct. Since compiler bugs leading to programs giving wrong answers are very rare, but on the other hand manual parallel programming is very error-prone, we should assume a problem with the way you parallelized your code. We are probably talking about a race condition.
And no, the problem you are having is not at all unusual. When you have a race condition, it happens often that the sequential execution works everywhere and your parallel execution works in some environments and fails in others. It's even common that your code gives different answers every time you call it (not only depending on the compiler, but on many other factors that can change over time).
What I suggest you should do, is to get a parallel debugger, like for example TotalView that will help you keep track of the various threads and their states. Try to find a simple test environment (as few threads as possible) that fails reliably.

LAPACK: Are operations on packed storage matrices faster?

I want to tridiagonalize a real symmetric matrix using Fortran and LAPACK. LAPACK basically provides two routines, one operating on the full matrix, the other on the matrix in packed storage. While the latter surely uses less memory, I was wondering if anything can be said about the speed difference?
It's an empirical question, of course: but in general, nothing comes for free, and less memory/more runtime is a pretty common tradeoff.
In this case, the indexing for the data is more complex for the packed case, so as you traverse the matrix, the cost of getting your data is a little higher. (Complicating this picture is that for symmetric matrices, the lapack routines also assume a certain kind of packing - that you only have the upper or lower component of the matrix available).
I was messing around with an eigenproblem earlier today, so I'll use that as a measurement benchmark; trying with a simple symmetric test case (The Herdon matrix, from http://people.sc.fsu.edu/~jburkardt/m_src/test_mat/test_mat.html ), and comparing ssyevd with sspevd
$ ./eigen2 500
Generating a Herdon matrix:
Unpacked array:
Eigenvalues L_infty err = 1.7881393E-06
Packed array:
Eigenvalues L_infty err = 3.0994415E-06
Packed time: 2.800000086426735E-002
Unpacked time: 2.500000037252903E-002
$ ./eigen2 1000
Generating a Herdon matrix:
Unpacked array:
Eigenvalues L_infty err = 4.5299530E-06
Packed array:
Eigenvalues L_infty err = 5.8412552E-06
Packed time: 0.193900004029274
Unpacked time: 0.165000006556511
$ ./eigen2 2500
Generating a Herdon matrix:
Unpacked array:
Eigenvalues L_infty err = 6.1988831E-06
Packed array:
Eigenvalues L_infty err = 8.4638596E-06
Packed time: 3.21040010452271
Unpacked time: 2.70149993896484
There's about an 18% difference, which I must admit is larger than I expected (also with a slightly larger error for the packed case?). This is with intel's MKL. The performance difference will depend on your matrix in general, of course, as eriktous points out, and on the problem you're doing; the more random access to the matrix you have to do, the worse the overhead would be. The code I used is as follows:
program eigens
implicit none
integer :: nargs,n ! problem size
real, dimension(:,:), allocatable :: A, B, Z
real, dimension(:), allocatable :: PA
real, dimension(:), allocatable :: work
integer, dimension(:), allocatable :: iwork
real, dimension(:), allocatable :: eigenvals, expected
real :: c, p
integer :: worksize, iworksize
character(len=100) :: nstr
integer :: unpackedclock, packedclock
double precision :: unpackedtime, packedtime
integer :: i,j,info
! get filename
nargs = command_argument_count()
if (nargs /= 1) then
print *,'Usage: eigen2 n'
print *,' Where n = size of array'
stop
endif
call get_command_argument(1, nstr)
read(nstr,'(I)') n
if (n < 4 .or. n > 25000) then
print *, 'Invalid n ', nstr
stop
endif
! Initialize local arrays
allocate(A(n,n),B(n,n))
allocate(eigenvals(n))
! calculate the matrix - unpacked
print *, 'Generating a Herdon matrix: '
A = 0.
c = (1.*n * (1.*n + 1.) * (2.*n - 5.))/6.
forall (i=1:n-1,j=1:n-1)
A(i,j) = -1.*i*j/c
endforall
forall (i=1:n-1)
A(i,i) = (c - 1.*i*i)/c
A(i,n) = 1.*i/c
endforall
forall (j=1:n-1)
A(n,j) = 1.*j/c
endforall
A(n,n) = -1./c
B = A
! expected eigenvalues
allocate(expected(n))
p = 3. + sqrt((4. * n - 3.) * (n - 1.)*3./(n+1.))
expected(1) = p/(n*(5.-2.*n))
expected(2) = 6./(p*(n+1.))
expected(3:n) = 1.
print *, 'Unpacked array:'
allocate(work(1),iwork(1))
call ssyevd('N','U',n,A,n,eigenvals,work,-1,iwork,-1,info)
worksize = int(work(1))
iworksize = int(work(1))
deallocate(work,iwork)
allocate(work(worksize),iwork(iworksize))
call tick(unpackedclock)
call ssyevd('N','U',n,A,n,eigenvals,work,worksize,iwork,iworksize,info)
unpackedtime = tock(unpackedclock)
deallocate(work,iwork)
if (info /= 0) then
print *, 'Error -- info = ', info
endif
print *,'Eigenvalues L_infty err = ', maxval(eigenvals-expected)
! pack array
print *, 'Packed array:'
allocate(PA(n*(n+1)/2))
allocate(Z(n,n))
do i=1,n
do j=i,n
PA(i+(j-1)*j/2) = B(i,j)
enddo
enddo
allocate(work(1),iwork(1))
call sspevd('N','U',n,PA,eigenvals,Z,n,work,-1,iwork,-1,info)
worksize = int(work(1))
iworksize = iwork(1)
deallocate(work,iwork)
allocate(work(worksize),iwork(iworksize))
call tick(packedclock)
call sspevd('N','U',n,PA,eigenvals,Z,n,work,worksize,iwork,iworksize,info)
packedtime = tock(packedclock)
deallocate(work,iwork)
deallocate(Z,A,B,PA)
if (info /= 0) then
print *, 'Error -- info = ', info
endif
print *,'Eigenvalues L_infty err = ', &
maxval(eigenvals-expected)
deallocate(eigenvals, expected)
print *,'Packed time: ', packedtime
print *,'Unpacked time: ', unpackedtime
contains
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
end program eigens