Can I speed this up any more? - fortran

I am interested in speeding up computation time for the subroutine compoundret which basically compounds a monthly return series over some holding period, say one month, three months, six months, etc.
I will be calling this subroutine from R as a dll. I have written a main function in the attached code snippet to get everything working in fortran.
subroutine compoundret(R_c, R, RF, horizons, Tn, N, M)
implicit none
! Arguments declarations
integer, intent(in) :: horizons(M), Tn, N, M
real*8, intent(in) :: RF(Tn), R(Tn, N, M)
real*8, intent(out) :: R_c(Tn, N, M)
! Intermediary Variables
integer :: t, j, k
real*8 :: RF_Temp(Tn, N, M)
R_c = 0.0
do t = 1, Tn
RF_Temp(t,:,:) = RF(t)
end do
!$acc data copyin(r(Tn,N,M), RF_Temp(Tn,N,M), horizons(M)), create(R_c(Tn,
N, M))
!$acc parallel loop
do k = 1, M
do j = 1, N
do t = 1, Tn - horizons(k) + 1
R_c( t, j, k) = PRODUCT( 1 + R( t:t + horizons(k) - 1, j, k) + &
RF_Temp( t:t + horizons(k) - 1, j, k)) - &
PRODUCT(1+ RF_Temp( t:t + horizons(k) - 1, j, k))
end do
end do
end do
!$acc end parallel
!$acc update host(R_c)
!$acc end data
end subroutine compoundret
Program main
implicit none
real*8 :: df(1000,5000, 6)
real*8 :: retdata(size(df,1),size(df,2),size(df,3)),RF(size(df,1))
integer :: horizons(6), Tn, N, M
Tn = size(df, 1)
N = size(df, 2)
M = size(df, 3)
df = 0.001
RF = 0.001
horizons(:) = (/1,3,6,12,24,48/)
call compoundret(retdata,df,RF,horizons, Tn, N, M)
print*, retdata(1, 1, 1:M)
end program
My target platform is a compute 6.0 device (GTX 1060).

I'd recommend collapsing the two outer loops and then adding "!$acc loop vector" on the "t" loop.
!$acc parallel loop collapse(2)
do k = 1, M
do j = 1, N
!$acc loop vector
do t = 1, Tn - horizons(k) + 1
R_c( t, j, k) = PRODUCT( 1 + R( t:t + horizons(k) - 1, j, k) + &
RF_Temp( t:t + horizons(k) - 1, j, k)) - &
PRODUCT(1+ RF_Temp( t:t + horizons(k) - 1, j, k))
end do
end do
end do
!$acc end parallel
Right now, you're only parallelizing the outer loop and since "M" is quite small, you're underutilizing the GPU.
Note that the PGI 2017 compilers have a bug which will prevent you from using OpenACC within a DLL (shared objects on Linux are fine). We're working on fixing this issue in the 18.1 compilers. You're current options are to either wait till 18.1 is released early next year or go back to the 16.10 compilers. If you're using the PGI Community edition, you'll need to wait for the 18.4 compilers in April.
Also, putting OpenACC in shared or dynamic libraries require the use of the "-ta=tesla:nordc" option.

Related

Why does a manually programmed matrix multiplication combined with matrix addition give better performance than the intrinsic functions?

I have some legacy code which performs the matrix operation of B = B + A*E as
DO I = 1,N
DO L = 1,N
DO K = 1,N
B(I,K) = B(I,K) + A(I,L)*E(L,K,J-1)
end do
end do
end do
To improve readability as well as take advantage of modern fortran intrinsic functions, I would like to write the above code as
B = B + matmul( A, E(:, 1:N, J-1) )
I noticed that the improved readability comes at the cost of performance. I determined that the problem is not with the intrinsic function matmul - the left figure shows that matmul performs just as well as the manually written operation for all values of N.
When matrix multiplication is combined with matrix addition, then for small values of N the manually written operation performs better than the intrinsic functions. For my uses, usually N < 10; I would like to improve the readability without losing the performance. Might there be a suggestion for that?
The code I am using is below. I am using Mac OS 10.14.6 with gfortran 8.2.0 and compiling with the -O3 optimization option.
program test
implicit none
integer :: loop_max = 1000
integer :: j ! loop index
integer :: i ! loop index
real :: t1, t2 ! start and end times
real :: t_manual, t_intrinsic, t_man_add, t_intrn_add
integer :: N ! matrix dimension
integer, parameter :: NJ = 12
real, dimension(:, :), allocatable :: A, B ! matrices
real, dimension(:, :), allocatable :: D
real, dimension(:), allocatable :: G
real, dimension(:, :, :), allocatable :: E
open(1, file = 'Delete.txt', status = 'unknown')
do N = 1, 40
allocate(A(N,N), B(N,N), G(N), D(N, 2*N+1), E(N, N+1, NJ))
! ##########################################################################
! manual matrix multiplication vs matmul
call rand_fill
call CPU_time(t1)
do i = 1, loop_max
do j = 2, 12
call matmul_manual(j, N, NJ, A, B, D, G, E)
end do
end do
call CPU_time(t2)
t_manual = t2 - t1
write(1, *) A, B, D, G, E
call rand_fill
call CPU_time(t1)
do i = 1, loop_max
do j = 2, 12
B = matmul( A, E(:, 1:N, j-1) )
end do
end do
call CPU_time(t2)
t_intrinsic = t2 - t1
write(1, *) A, B, D, G, E
! --------------------------------------------------------------------------
! ##########################################################################
! manual matrix multiplication with matrix addition
call rand_fill
call CPU_time(t1)
do i = 1, loop_max
do j = 2, 12
call manual_matmul_add(j, N, NJ, A, B, D, G, E)
end do
end do
call CPU_time(t2)
t_man_add = t2 - t1
write(1, *) A, B, D, G, E
! --------------------------------------------------------------------------
! ##########################################################################
! intrinsic matrix multiplication (matmul) with matrix addition
call rand_fill
call CPU_time(t1)
do i = 1, loop_max
do j = 2, 12
call intrinsic_matmul_add(j, N, NJ, A, B, D, G, E)
end do
end do
call CPU_time(t2)
t_intrn_add = t2 - t1
write(1, *) A, B, D, G, E
! --------------------------------------------------------------------------
deallocate(A, B, D, G, E)
print*, N, t_manual, t_intrinsic, t_man_add, t_intrn_add
end do
contains
subroutine rand_fill
! fill the matrices with random numbers
call random_number(A)
call random_number(B)
call random_number(D)
call random_number(G)
call random_number(E)
end subroutine
end program test
subroutine matmul_manual(j, N, NJ, A, B, D, G, E)
implicit none
integer, intent(in) :: j
integer, intent(in) :: N, NJ
real, dimension(N, N), intent(in out) :: A, B
real, dimension(N, 2*N+1), intent(in out) :: D
real, dimension(N), intent(in out) :: G
real, dimension(N, N+1, NJ), intent(in out) :: E
integer :: I, L, K ! loop indices
B = 0.0
DO I = 1,N
DO L = 1,N
DO K = 1,N
B(I,K) = B(I,K) + A(I,L)*E(L,K,J-1)
end do
end do
end do
end subroutine matmul_manual
subroutine manual_matmul_add(j, N, NJ, A, B, D, G, E)
implicit none
integer, intent(in) :: j
integer, intent(in) :: N, NJ
real, dimension(N, N), intent(in out) :: A, B
real, dimension(N, 2*N+1), intent(in out) :: D
real, dimension(N), intent(in out) :: G
real, dimension(N, N+1, NJ), intent(in out) :: E
integer :: I, L, K ! loop indices
DO I = 1,N
D(I,N+1) = -G(I)
DO L = 1,N
D(I,N+1) = D(I,N+1)+A(I,L)*E(L,N+1,J-1)
DO K = 1,N
B(I,K) = B(I,K) + A(I,L)*E(L,K,J-1)
end do
end do
end do
end subroutine manual_matmul_add
subroutine intrinsic_matmul_add(j, N, NJ, A, B, D, G, E)
implicit none
integer, intent(in) :: j
integer, intent(in) :: N, NJ
real, dimension(N, N), intent(in out) :: A, B
real, dimension(N, 2*N+1), intent(in out) :: D
real, dimension(N), intent(in out) :: G
real, dimension(N, N+1, NJ), intent(in out) :: E
real, dimension(N, N+1) :: temp1
real, dimension(N, N) :: temp2
D(:, N+1) = -G + matmul( A, E(:, N+1, j-1) )
B = B + matmul( A, E(:, 1:N, j-1) )
end subroutine intrinsic_matmul_add
subroutine mat_sub_new(j, N, NJ, A, B, D, G, E)
implicit none
integer, intent(in) :: j
integer, intent(in) :: N, NJ
real, dimension(N, N), intent(in out) :: A, B
real, dimension(N, 2*N+1), intent(in out) :: D
real, dimension(N), intent(in out) :: G
real, dimension(N, N+1, NJ), intent(in out) :: E
if (N == 1) then ! matmul seems to be inefficient when N = 1
D(N,N+1) = -G(N) + A(N,N)*E(N, N+1, J-1)
B(N,N) = B(N,N) + A(N,N)*E(N, N, J-1)
else
D(:, N+1) = -G + matmul( A, E(:, N+1, j-1) )
B = B + matmul( A, E(:, 1:N, j-1) )
end if
end subroutine mat_sub_new
I suspect this has to do with two issues:
How the compiler resolves the MATMUL generic interface to a proper call to the correct routine (REAL vs. COMPLEX vs. INTEGER, or matrix times vector vs. matrix times matrix): I have no idea whether this is done systematically at compilation or at runtime, or whether this choice is made based on the optimization level; (this would justify the additional overhead especially for the low-size cases)
The internal "general purpose" algorithm may not be in general the best suited for your problem, as it looks like in some cases, brute-force compiler optimization does a better job. Is gfortran's MATMUL intrinsics based on BLAS, for example? If so, that may not be the fastest.
I've done a similar test with NORM2 on my PC (Windows, i7, gfortran 9.2.0, compiled with -O3 -march=core-avx2): Turns out that, for NORM2:
The BLAS implementation is always slowest, despite having been slightly refactored to feed the compiler a PURE version
Usage of the intrinsics (both NORM2 or SQRT(SUM(X**2))) is always slow, regardless of the array size
The fastest cases are when either using a simple loop, or the intrinsics with a fixed-size array:
ARRAY SIZE assumed-shape fixed-size intrinsic NORM2 LOOP BLAS
N [ms/N] [ms/N] [ms/N] [ms/N] [ms/N]
2 5.93750E-09 4.06250E-09 8.43750E-09 4.06250E-09 1.03125E-08
12 1.03125E-08 7.81250E-09 3.12500E-08 7.81250E-09 5.09375E-08
22 1.65625E-08 1.40625E-08 5.50000E-08 1.43750E-08 9.15625E-08
32 2.25000E-08 2.00000E-08 7.81250E-08 2.00000E-08 1.29375E-07
BTW The code is pasted here below (beware of the large memory footprint!):
program test_norm
use iso_fortran_env
implicit none
integer :: xsize,i,iunit,icase
integer, parameter :: testSize = 50000000
real(real64), allocatable :: set(:,:),setNorm(:)
real(real64) :: t0,t1,setSum(5),timeTable(5,35)
intrinsic :: norm2
open(newunit=iunit,file='garbage.txt',action='write')
print '(6(1x,a15))' ,'ARRAY SIZE','assumed-shape','fixed-size','intrinsic NORM2','LOOP','BLAS'
print '(6(1x,a15))' ,'N',('[ms/N]',i=1,5)
icase = 0
do xsize = 2,32,10
! Initialize test set
icase = icase+1
allocate(set(xsize,testSize),setNorm(testSize))
call random_number(set)
! Test 1: intrinsic SQRT/SUM, assumed-shape array
call cpu_time(t0); forall(i=1:testSize) setNorm(i) = norm_v1(set(:,i)); call cpu_time(t1)
setSum(1) = sum(setNorm); timeTable(1,icase) = t1-t0
! Test 2: intrinsic SQRT/SUM, fixed-size array
call cpu_time(t0); forall(i=1:testSize) setNorm(i) = norm_v2(xsize,set(:,i)); call cpu_time(t1)
setSum(2) = sum(setNorm); timeTable(2,icase) = t1-t0
! Test 3: intrinsic NORM2
call cpu_time(t0); forall(i=1:testSize) setNorm(i) = norm2(set(:,i)); call cpu_time(t1)
setSum(3) = sum(setNorm); timeTable(3,icase) = t1-t0
! Test 4: LOOP
call cpu_time(t0); forall(i=1:testSize) setNorm(i) = norm_v3(xsize,set(:,i)); call cpu_time(t1)
setSum(4) = sum(setNorm); timeTable(4,icase) = t1-t0
! Test 5: BLAS
call cpu_time(t0); forall(i=1:testSize) setNorm(i) = DNRM2(xsize,set(:,i),1); call cpu_time(t1)
setSum(5) = sum(setNorm); timeTable(5,icase) = t1-t0
! Print output
print '(7x,i2,9x,5(2x,1pe13.5e2,1x))', xsize,timeTable(:,icase)/testSize
write (iunit,*) 'setSum = ',setSum
deallocate(set,setNorm)
end do
close(iunit)
contains
pure real(real64) function norm_v1(x) result(L2)
real(real64), intent(in) :: x(:)
L2 = sqrt(sum(x**2))
end function norm_v1
pure real(real64) function norm_v2(n,x) result(L2)
integer, intent(in) :: n
real(real64), intent(in) :: x(n)
L2 = sqrt(sum(x**2))
end function norm_v2
pure real(real64) function norm_v3(n,x) result(L2)
integer, intent(in) :: n
real(real64), intent(in) :: x(n)
integer :: i
L2 = 0.0_real64
do i=1,n
L2 = L2 + x(i)**2
end do
L2 = sqrt(L2)
end function norm_v3
PURE REAL(REAL64) FUNCTION DNRM2 ( N, X, INCX )
INTEGER, INTENT(IN) :: N,INCX
REAL(REAL64), INTENT(IN) :: X( * )
REAL(REAL64), PARAMETER :: ONE = 1.0_REAL64
REAL(REAL64), PARAMETER :: ZERO = 0.0_REAL64
INTEGER :: IX
REAL(REAL64) :: ABSXI, NORM, SCALE, SSQ
INTRINSIC :: ABS, SQRT
IF( N<1 .OR. INCX<1 )THEN
NORM = ZERO
ELSE IF( N==1 )THEN
NORM = ABS( X( 1 ) )
ELSE
SCALE = ZERO
SSQ = ONE
DO IX = 1, 1 + ( N - 1 )*INCX, INCX
IF( X( IX )/=ZERO )THEN
ABSXI = ABS( X( IX ) )
IF( SCALE<ABSXI )THEN
SSQ = ONE + SSQ*( SCALE/ABSXI )**2
SCALE = ABSXI
ELSE
SSQ = SSQ + ( ABSXI/SCALE )**2
END IF
END IF
END DO
NORM = SCALE * SQRT( SSQ )
END IF
DNRM2 = NORM
RETURN
END FUNCTION DNRM2
end program test_norm

Storing a Variable with a Multi-Dimensional Index in Fortran

Question
Consider the following code:
program example
implicit none
integer, parameter :: n_coeffs = 1000
integer, parameter :: n_indices = 5
integer :: i
real(8), dimension(n_coeffs) :: coeff
integer, dimension(n_coeffs,n_indices) :: index
do i = 1, n_coeffs
coeff(i) = real(i*3,8)
index(i,:) = [2,4,8,16,32]*i
end do
end
For any 5 dimensional index I need to obtain the associated coefficient, without knowing or calculating i. For instance, given [2,4,8,16,32] I need to obtain 3.0 without computing i.
Is there a reasonable solution, perhaps using sparse matrices, that would work for n_indices in the order of 100 (though n_coeffs still in the order of 1000)?
A Bad Solution
One solution would be to define a 5 dimensional array as in
real(8), dimension(2000,4000,8000,16000,32000) :: coeff2
do i = 1, ncoeffs
coeff2(index(i,1),index(i,2),index(i,3),index(i,4),index(i,5)) = coeff(i)
end do
then, to get the coefficient associated with [2,4,8,16,32], call
coeff2(2,4,8,16,32)
However, besides being very wasteful of memory, this solution would not allow n_indices to be set to a number higher than 7 given the limit of 7 dimensions to an array.
OBS: This question is a spin-off of this one. I have tried to ask the question more precisely having failed in the first attempt, an effort that greatly benefited from the answer of #Rodrigo_Rodrigues.
Actual Code
In case it helps here is the code for the actual problem I am trying to solve. It is an adaptive sparse grid method for approximating a function. The main goal is to make the interpolation at the and as fast as possible:
MODULE MOD_PARAMETERS
IMPLICIT NONE
SAVE
INTEGER, PARAMETER :: d = 2 ! number of dimensions
INTEGER, PARAMETER :: L_0 = 4 ! after this adaptive grid kicks in, for L <= L_0 usual sparse grid
INTEGER, PARAMETER :: L_max = 9 ! maximum level
INTEGER, PARAMETER :: bound = 0 ! 0 -> for f = 0 at boundary
! 1 -> adding grid points at boundary
! 2 -> extrapolating close to boundary
INTEGER, PARAMETER :: max_error = 1
INTEGER, PARAMETER :: L2_error = 1
INTEGER, PARAMETER :: testing_sample = 1000000
REAL(8), PARAMETER :: eps = 0.01D0 ! epsilon for adaptive grid
END MODULE MOD_PARAMETERS
PROGRAM MAIN
USE MOD_PARAMETERS
IMPLICIT NONE
INTEGER, DIMENSION(d,d) :: ident
REAL(8), DIMENSION(d) :: xd
INTEGER, DIMENSION(2*d) :: temp
INTEGER, DIMENSION(:,:), ALLOCATABLE :: grid_index, temp_grid_index, grid_index_new, J_index
REAL(8), DIMENSION(:), ALLOCATABLE :: coeff, temp_coeff, J_coeff
REAL(8) :: temp_min, temp_max, V, T, B, F, x1
INTEGER :: k, k_1, k_2, h, i, j, L, n, dd, L1, L2, dsize, count, first, repeated, add, ind
INTEGER :: time1, time2, clock_rate, clock_max
REAL(8), DIMENSION(L_max,L_max,2**(L_max),2**(L_max)) :: coeff_grid
INTEGER, DIMENSION(d) :: level, LL, ii
REAL(8), DIMENSION(testing_sample,d) :: x_rand
REAL(8), DIMENSION(testing_sample) :: interp1, interp2
! ============================================================================
! EXECUTABLE
! ============================================================================
ident = 0
DO i = 1,d
ident(i,i) = 1
ENDDO
! Initial grid point
dsize = 1
ALLOCATE(grid_index(dsize,2*d),grid_index_new(dsize,2*d))
grid_index(1,:) = 1
grid_index_new = grid_index
ALLOCATE(coeff(dsize))
xd = (/ 0.5D0, 0.5D0 /)
CALL FF(xd,coeff(1))
CALL FF(xd,coeff_grid(1,1,1,1))
L = 1
n = SIZE(grid_index_new,1)
ALLOCATE(J_index(n*2*d,2*d))
ALLOCATE(J_coeff(n*2*d))
CALL SYSTEM_CLOCK (time1,clock_rate,clock_max)
DO WHILE (L .LT. L_max)
L = L+1
n = SIZE(grid_index_new,1)
count = 0
first = 1
DEALLOCATE(J_index,J_coeff)
ALLOCATE(J_index(n*2*d,2*d))
ALLOCATE(J_coeff(n*2*d))
J_index = 0
J_coeff = 0.0D0
DO k = 1,n
DO i = 1,d
DO j = 1,2
IF ((bound .EQ. 0) .OR. (bound .EQ. 2)) THEN
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(grid_index_new(k,d+i)-(-1)**j)/)
ELSEIF (bound .EQ. 1) THEN
IF (grid_index_new(k,i) .EQ. 1) THEN
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(-(-1)**j)/)
ELSE
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(grid_index_new(k,d+i)-(-1)**j)/)
ENDIF
ENDIF
CALL XX(d,temp(1:d),temp(d+1:2*d),xd)
temp_min = MINVAL(xd)
temp_max = MAXVAL(xd)
IF ((temp_min .GE. 0.0D0) .AND. (temp_max .LE. 1.0D0)) THEN
IF (first .EQ. 1) THEN
first = 0
count = count+1
J_index(count,:) = temp
V = 0.0D0
DO k_1 = 1,SIZE(grid_index,1)
T = 1.0D0
DO k_2 = 1,d
CALL XX(1,temp(k_2),temp(d+k_2),x1)
CALL BASE(x1,grid_index(k_1,k_2),grid_index(k_1,k_2+d),B)
T = T*B
ENDDO
V = V+coeff(k_1)*T
ENDDO
CALL FF(xd,F)
J_coeff(count) = F-V
ELSE
repeated = 0
DO h = 1,count
IF (SUM(ABS(J_index(h,:)-temp)) .EQ. 0) THEN
repeated = 1
ENDIF
ENDDO
IF (repeated .EQ. 0) THEN
count = count+1
J_index(count,:) = temp
V = 0.0D0
DO k_1 = 1,SIZE(grid_index,1)
T = 1.0D0
DO k_2 = 1,d
CALL XX(1,temp(k_2),temp(d+k_2),x1)
CALL BASE(x1,grid_index(k_1,k_2),grid_index(k_1,k_2+d),B)
T = T*B
ENDDO
V = V+coeff(k_1)*T
ENDDO
CALL FF(xd,F)
J_coeff(count) = F-V
ENDIF
ENDIF
ENDIF
ENDDO
ENDDO
ENDDO
ALLOCATE(temp_grid_index(dsize,2*d))
ALLOCATE(temp_coeff(dsize))
temp_grid_index = grid_index
temp_coeff = coeff
DEALLOCATE(grid_index,coeff)
ALLOCATE(grid_index(dsize+count,2*d))
ALLOCATE(coeff(dsize+count))
grid_index(1:dsize,:) = temp_grid_index
coeff(1:dsize) = temp_coeff
DEALLOCATE(temp_grid_index,temp_coeff)
grid_index(dsize+1:dsize+count,:) = J_index(1:count,:)
coeff(dsize+1:dsize+count) = J_coeff(1:count)
dsize = dsize + count
DO i = 1,count
coeff_grid(J_index(i,1),J_index(i,2),J_index(i,3),J_index(i,4)) = J_coeff(i)
ENDDO
IF (L .LE. L_0) THEN
DEALLOCATE(grid_index_new)
ALLOCATE(grid_index_new(count,2*d))
grid_index_new = J_index(1:count,:)
ELSE
add = 0
DO h = 1,count
IF (ABS(J_coeff(h)) .GT. eps) THEN
add = add + 1
J_index(add,:) = J_index(h,:)
ENDIF
ENDDO
DEALLOCATE(grid_index_new)
ALLOCATE(grid_index_new(add,2*d))
grid_index_new = J_index(1:add,:)
ENDIF
ENDDO
CALL SYSTEM_CLOCK (time2,clock_rate,clock_max)
PRINT *, 'Elapsed real time1 = ', DBLE(time2-time1)/DBLE(clock_rate)
PRINT *, 'Grid Points = ', SIZE(grid_index,1)
! ============================================================================
! Compute interpolated values:
! ============================================================================
CALL RANDOM_NUMBER(x_rand)
CALL SYSTEM_CLOCK (time1,clock_rate,clock_max)
DO i = 1,testing_sample
V = 0.0D0
DO L1=1,L_max
DO L2=1,L_max
IF (L1+L2 .LE. L_max+1) THEN
level = (/L1,L2/)
T = 1.0D0
DO dd = 1,d
T = T*(1.0D0-ABS(x_rand(i,dd)/2.0D0**(-DBLE(level(dd)))-DBLE(2*FLOOR(x_rand(i,dd)*2.0D0**DBLE(level(dd)-1))+1)))
ENDDO
V = V + coeff_grid(L1,L2,2*FLOOR(x_rand(i,1)*2.0D0**DBLE(L1-1))+1,2*FLOOR(x_rand(i,2)*2.0D0**DBLE(L2-1))+1)*T
ENDIF
ENDDO
ENDDO
interp2(i) = V
ENDDO
CALL SYSTEM_CLOCK (time2,clock_rate,clock_max)
PRINT *, 'Elapsed real time2 = ', DBLE(time2-time1)/DBLE(clock_rate)
END PROGRAM
For any 5 dimensional index I need to obtain the associated
coefficient, without knowing or calculating i. For instance, given
[2,4,8,16,32] I need to obtain 3.0 without computing i.
function findloc_vector(matrix, vector) result(out)
integer, intent(in) :: matrix(:, :)
integer, intent(in) :: vector(size(matrix, dim=2))
integer :: out, i
do i = 1, size(matrix, dim=1)
if (all(matrix(i, :) == vector)) then
out = i
return
end if
end do
stop "No match for this vector"
end
And that's how you use it:
print*, coeff(findloc_vector(index, [2,4,8,16,32])) ! outputs 3.0
I must confess I was reluctant to post this code because, even though this answers your question, I honestly think this is not what you really want/need, but you dind't provide enough information for me to know what you really do want/need.
Edit (After actual code from OP):
If I decrypted your code correctly (and considering what you said in your previous question), you are declaring:
REAL(8), DIMENSION(L_max,L_max,2**(L_max),2**(L_max)) :: coeff_grid
(where L_max = 9, so size(coeff_grid) = 21233664 =~160MB) and then populating it with:
DO i = 1,count
coeff_grid(J_index(i,1),J_index(i,2),J_index(i,3),J_index(i,4)) = J_coeff(i)
ENDDO
(where count is of the order of 1000, i.e. 0.005% of its elements), so this way you can fetch the values by its 4 indices with the array notation.
Please, don't do that. You don't need a sparse matrix in this case either. The new approach you proposed is much better: storing the indices in each row of an smaller array, and fetching on the array of coefficients by the corresponding location of those indices in its own array. This is way faster (avoiding the large allocation) and much more memory-efficient.
PS: Is it mandatory for you to stick to Fortran 90? Its a very old version of the standard and chances are that the compiler you're using implements a more recent version. You could improve the quality of your code a lot with the intrinsic move_alloc (for less array copies), the kind constants from the intrinsic module iso_fortran_env (for portability), the [], >, <, <=,... notation (for readability)...

Fortran's MKL dgemm is giving zero result

In the following Fortran program I use Intel's MKL library to perform matrix multiplications using dgemm. Initially, I used the matmul subroutine and got correct results. When I translated matmul to dgemm in the loop below, I got all zero vectors instead of the correct output. I appreciate your help.
program spectral_norm
implicit none
!
integer, parameter :: n = 5500, dp = kind(0.0d0)
real(dp), allocatable :: A(:, :), u(:), v(:), Au(:), Av(:)
integer :: i, j
allocate(u(n), v(n), A(n, n), Au(n), Av(n))
do j = 1, n
do i = 1, n
A(i, j) = Ac(i, j)
end do
end do
u = 1
do i = 1, 10
call dgemm('N','N', n, 1, n, 1.0, A, n, u, n, 0.0, Au, n)
call dgemm('N','N', n, 1, n, 1.0, Au, n, A, n, 0.0, v, n)
call dgemm('N','N', n, 1, n, 1.0, A, n, v, n, 0.0, Av, n)
call dgemm('N','N', n, 1, n, 1.0, Av, n, A, n, 0.0, u, n)
!v = matmul(matmul(A, u), A)
!u = matmul(matmul(A, v), A)
end do
write(*, "(f0.9)") sqrt(dot_product(u, v) / dot_product(v, v))
contains
pure real(dp) function Ac(i, j) result(r)
integer, intent(in) :: i, j
r = 1._dp / ((i+j-2) * (i+j-1)/2 + i)
end function
end program spectral_norm
This gives NaN, while the correct output from matmul is 1.274224153.
Well, thank you all for your suggestions. I think I figured out the source of the error. The order of multiplication was reversed in two cases, it should have been A * Au and A * Av instead. This is because A has order n x n and both Au ans Av have order n x 1. So, we can't multiply Au * A or Av * A due to dimensions mismatch. I posted the corrected version below.
program spectral_norm
implicit none
!
integer, parameter :: n = 5500, dp = kind(0.d0)
real(dp), allocatable :: A(:,:), u(:), v(:), Au(:), Av(:)
integer :: i, j
allocate(u(n), v(n), A(n, n), Au(n), Av(n))
do j = 1, n
do i = 1, n
A(i, j) = Ac(i, j)
end do
end do
u = 1
do i = 1, 10
call dgemm('N', 'N', n, 1, n, 1._dp, A, n, u, n, 0._dp, Au, n)
call dgemm('T', 'N', n, 1, n, 1._dp, A, n, Au, n, 0._dp, v, n)
call dgemm('N', 'N', n, 1, n, 1._dp, A, n, v, n, 0._dp, Av, n)
call dgemm('T', 'N', n, 1, n, 1._dp, A, n, Av, n, 0._dp, u, n)
end do
write(*, "(f0.9)") sqrt(dot_product(u, v) / dot_product(v, v))
contains
pure real(dp) function Ac(i, j) result(r)
integer, intent(in) :: i, j
r = 1._dp / ((i+j-2) * (i+j-1)/2 + i)
end function
end program spectral_norm
This gives the correct results:
1.274224153
Elapsed time 0.5150000 seconds

Eigenvector discontinuities in hermitian matrix diagonalization

I need to diagonalize a 2x2 Hermitian matrix that depends on a parameter x, which varies continuously. For diagonalization I use EISPACK. When I plot the real and imaginary components of eigenvectors as a function of x, I notice that they have discontinuities. The eigenvalues calculation is OK. When I plot the eigenvectors in Maxima, the solutions appear continuous. I need the continuous eigenvectors since in next step I will need to calculate their derivatives.
Below the f77 code I use as test (compiling with gfortran on mingw).
program Eigenvalue
implicit none
integer n, m
parameter (n=2)
integer ierr, matz, i, j
double precision x, dx, xf, amp, xin
double precision w(n)
double precision Ar(n,n), Ai(n,n)
double precision xr(n,n), xi(n,n)
double precision fm1(2,n) ! f77
double precision fv1(n) ! f77
double precision fv2(n) ! f77
double complex psi1a, psi1b, psi2a, psi2b
m = 51
xf = 10.d0
xin = 0.0d0
amp = 2.d0
dx = (xf - xin)/(m-1)
do i = 1, m
x = dx*(i-m) + xf
Ar(1,1) = dsin(x)**2
Ar(1,2) = amp*dcos(x)
Ar(2,1) = amp*dcos(x)
Ar(2,2) = dcos(x)**2
Ai(1,1) = 0.0d0
Ai(1,2) = amp*dsin(x)
Ai(2,1) = -amp*dsin(x)
Ai(2,2) = 0.0d0
matz = 1
call ch ( n, n, ar, ai, w, matz, xr, xi, fv1, fv2, fm1, ierr ) !f77
write(20,*) x, w(1), w(2)
write(21,*) x, xr(1,1), xi(1,1)
write(22,*) x, xr(2,1), xi(2,1)
write(23,*) x, xr(1,2), xi(1,2)
write(24,*) x, xr(2,2), xi(2,2)
! autovetor 1
psi1a = cmplx(xr(1,1),xi(1,1))
psi1b = cmplx(xr(1,2),xi(1,2))
! autovetor 2
psi2a = cmplx(xr(2,1),xi(2,1))
psi2b = cmplx(xr(2,2),xi(2,2))
end do
end
While not really an answer, what follows is the code I used with LAPACK.
I used the latest versions of LAPACK and BLAS, with the following compiler options:
gfortran -Og -std=f2008 -Wall -Wextra {location_of_lapack}/liblapack.a {location_of_blas}/blas_LINUX.a main.f90 -o main
I'm compiling on Mac OS X with gfortran 6.3.0 from homebrew.
As Ian mentioned above, things like dcos are replaced with cos and I have used the KIND= formulation to ensure the same precision.
Ian also mentioned above about the arbitrary phase.
This problem is answered here; I have translated this solution into my code below.
The "magic" happens after the call to ZHEEV.
With this fix, I see no discontinuities.
program Eigenvalue
!> This can be used with the f2008 call
use, intrinsic :: iso_fortran_env
implicit none
!> dp contains the kind value for double precision.
!> Use below if compiling to f2008
integer, parameter :: dp = REAL64
!> Use below if compiling with f95 up and comment out iso_fortran_env
!>integer, parameter :: dp = SELECTED_REAL_KIND(15, 300)
!> Set wp to the desired precision.
integer, parameter :: wp = dp
integer, parameter :: n = 2
integer :: i, j, k, m
real(kind=wp) x, dx, xf, amp, xin
real(kind=wp), dimension(n) :: w
real(kind=wp), dimension(n, n) :: Ar, Ai
complex(kind=wp), dimension(n, n) :: A
complex(kind=wp), dimension(max(1,2*n-1)) :: WORK
integer, parameter :: lwork = max(1,2*n-1)
real(kind=wp), dimension(max(1, 3*n-2)) :: RWORK
integer :: info
complex(kind=wp) :: psi1a, psi1b, psi2a, psi2b
real(kind=wp) :: mag
m = 51
xf = 10.0_wp
xin = 0.0_wp
amp = 2.0_wp
if (m .eq. 1) then
dx = 0.0_wp
else
dx = (xf - xin)/(m-1)
end if
do i = 1, m
x = dx*(i-m) + xf
Ar(1,1) = sin(x)**2
Ar(1,2) = amp*cos(x)
Ar(2,1) = amp*cos(x)
Ar(2,2) = cos(x)**2
Ai(1,1) = 0.0_wp
Ai(1,2) = amp*sin(x)
Ai(2,1) = -amp*sin(x)
Ai(2,2) = 0.0_wp
do j = 1, n
do k = 1, n
A(j, k) = cmplx(Ar(j, k), Ai(j, k), kind=wp)
end do
end do
call ZHEEV('V', 'U', N, A, N, W, WORK, LWORK, RWORK, INFO)
do j = 1, n
A(:, j) = A(:, j) / A(1, j)
mag = sqrt(real(A(1, j)*conjg(A(1, j)))+ real(A(2, j)*conjg(A(2, j))))
A(:, j) = A(:, j)/mag
end do
psi1a = A(1, 1)
psi1b = A(1, 2)
psi2a = A(2, 1)
psi2b = A(2, 2)
end do
end program Eigenvalue

Shooting method in fortran (neutron star oscillation)

I have been writing a script in fortran 90 for solving the radial oscillation problem of a neutron star with the use of shooting method. But for unknown reason, my program never works out. Without the shooting method component, the program runs smoothly as it successfully constructed the star. But once the shooting comes in, everything dies.
PROGRAM ROSCILLATION2
USE eos_parameters
IMPLICIT NONE
INTEGER ::i, j, k, l
INTEGER, PARAMETER :: N_ode = 5
REAL, DIMENSION(N_ode) :: y
REAL(8) :: rho0_cgs, rho0, P0, r0, phi0, pi
REAL(8) :: r, rend, mass, P, phi, delta, xi, eta
REAL(8) :: step, omega, omegastep, tiny, rho_print, Radius, B, a2, s0, lamda, E0, E
EXTERNAL :: fcn
!!!! User input
rho0_cgs = 2.D+15 !central density in cgs unit
step = 1.D-4 ! step size dr
omegastep = 1.D-2 ! step size d(omega)
tiny = 1.D-8 ! small number P(R)/P(0) to define star surface
!!!!!!!!!
open(unit=15, file="data.dat", status="new")
pi = ACOS(-1.D0)
a2 =((((1.6022D-13)**4)*(6.674D-11)*((2.997D8)**-7)*((1.0546D-34)**-3)*(1.D6))**(0.5D0))*a2_MeV !convert to code unit (km^-1)
B = ((1.6022D-13)**4)*(6.674D-11)*((2.997D8)**-7)*((1.0546D-34)**-3)*(1.D6)*B_MeV !convert to code unit (km^-2)
s0 = (1.D0/3.D0) - (1/(6*pi**2))*a2*((1/(16*pi**2)*a2**2 + (pi**-2)*a4*(rho0 - B))**-0.5) !square of the spped of sound at r=0
lamda = -0.5D0*log(1-2*y(1)/r)
E0 = (r0**-2)*s0*exp(lamda + 3*phi0)
rho0 = rho0_cgs*6.67D-18 / 9.D0 !convert rho0 to code unit (km^-2)
!! Calculate central pressure P0
P0 = (1.D0/3.D0)*rho0 - (4.D0/3.D0)*B - (1.D0/(a4*(12.D0)*(pi**2)))*a2**2 - &
&(a2/((3.D0)*a4))*(((1.D0/(16.D0*pi**4))*a2**2+(1.D0/(pi**2))*a4*(rho0-B))**0.5D0)
!! initial value for metric function phi
phi0 = 0.1D0 ! arbitrary (needed to be adjusted later)
r0 = 1.D-30 ! integration starting point
!! Set initial conditions
!!!!!!!!!!!!!!!!!
!!Start integration loop
!!!!!!!!!!!!!!!!!
r = r0
y(1) = 0.D0
y(2) = P0
y(3) = phi0
y(4) = 1/(3*E0)
y(5) = 1
omega = 2*pi*1000/(2.997D5) !omega of 1kHz in code unit
DO l = 1, 1000
omega = omega + omegastep !shooting method part
DO i = 1, 1000000000
rend = r0 + REAL(i)*step
call oderk(r,rend,y,N_ode,fcn)
r = rend
mass = y(1)
P = y(2)
phi = y(3)
xi = y(4)
eta = y(5)
IF (P < tiny*P0) THEN
WRITE(*,*) "Central density (10^14 cgs) = ", rho0_cgs/1.D14
WRITE(*,*) " Mass (solar mass) = ", mass/1.477D0
WRITE(*,*) " Radius (km) = ", r
WRITE(*,*) " Compactness M/R ", mass/r
WRITE(15,*) (omega*2.997D5/(2*pi)), y(5)
GOTO 21
ENDIF
ENDDO
ENDDO
21 CONTINUE
END PROGRAM roscillation2
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
SUBROUTINE fcn(r,y,yprime)
USE eos_parameters
IMPLICIT NONE
REAL(8), DIMENSION(5) :: y, yprime
REAL(8) :: r, m, P, phi, rho, pi, B, a2, xi, eta, W, Q, E, s, lamda, omega
INTEGER :: j
pi = ACOS(-1.D0)
a2 =((((1.6022D-13)**4)*(6.674D-11)*((2.997D8)**-7)*((1.0546D-34)**-3)*(1.D6))**(0.5D0))*a2_MeV !convert to code unit (km^-1)
B = ((1.6022D-13)**4)*(6.674D-11)*((2.997D8)**-7)*((1.0546D-34)**-3)*(1.D6)*B_MeV !convert to code unit (km^-2)
m = y(1)
P = y(2)
phi = y(3)
xi = y(4)
eta = y(5)
rho = 3.D0*P + 4.D0*B +((3.D0)/(4.D0*a4*(pi**2)))*a2**2+(a2/a4)*&
&(((9.D0/((16.D0)*(pi**4)))*a2**2+((3.D0/(pi**2))*a4*(P+B)))**0.5D0)
s = (1.D0/3.D0) - (1/(6*pi**2))*a2*((1/(16*pi**2)*a2**2 + (pi**-2)*a4*(rho - B))**-0.5) !square of speed of sound
W = (r**-2)*(rho + P)*exp(3*lamda + phi)
E = (r**-2)*s*exp(lamda + 3*phi)
Q = (r**-2)*exp(lamda + 3*phi)*(rho + P)*((yprime(3)**2) + 4*(r**-1)*yprime(3)- 8*pi*P*exp(2*lamda))
yprime(1) = 4.D0*pi*rho*r**2
yprime(2) = - (rho + P)*(m + 4.D0*pi*P*r**3)/(r*(r-2.D0*m))
yprime(3) = (m + 4.D0*pi*P*r**3)/(r*(r-2.D0*m))
yprime(4) = y(5)/(3*E)
yprime(5) = -(W*omega**2 + Q)*y(4)
END SUBROUTINE fcn
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!
!! Runge-Kutta method (from Numerical Recipes)
!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
subroutine oderk(ri,re,y,n,derivs)
INTEGER, PARAMETER :: NMAX=16
REAL(8) :: ri, re, step
REAL(8), DIMENSION(NMAX) :: y, dydx, yout
EXTERNAL :: derivs,rk4
call derivs(ri,y,dydx)
step=re-ri
CALL rk4(y,dydx,n,ri,step,yout,derivs)
do i=1,n
y(i)=yout(i)
enddo
return
end subroutine oderk
SUBROUTINE RK4(Y,DYDX,N,X,H,YOUT,DERIVS)
INTEGER, PARAMETER :: NMAX=16
REAL(8) :: H,HH,XH,X,H6
REAL(8), DIMENSION(N) :: Y, DYDX, YOUT
REAL(8), DIMENSION(NMAX) :: YT, DYT, DYM
EXTERNAL :: derivs
HH=H*0.5D0
H6=H/6D0
XH=X+HH
DO I=1,N
YT(I)=Y(I)+HH*DYDX(I)
ENDDO
CALL DERIVS(XH,YT,DYT)
DO I=1,N
YT(I)=Y(I)+HH*DYT(I)
ENDDO
CALL DERIVS(XH,YT,DYM)
DO I=1,N
YT(I)=Y(I)+H*DYM(I)
DYM(I)=DYT(I)+DYM(I)
ENDDO
CALL DERIVS(X+H,YT,DYT)
DO I=1,N
YOUT(I)=Y(I)+H6*(DYDX(I)+DYT(I)+2*DYM(I))
ENDDO
END SUBROUTINE RK4
Any reply would be great i am just really depressed for the long debugging.
Your program is blowing up because of this line:
yprime(5) = -(W*omega**2 + Q)*y(4)
in subroutine fcn. In this subroutine, omega is completely independent of the one declared in your main program. This one is uninitialized and used in an expression, which will either contain random values or zero, if your compiler is nice enough (or told) to initialize variables.
If you want the variable omega from your main program to be the same variable you use in fcn then you need to pass that variable to fcn somehow. Due to the way you've architected this program, passing it would require modifying all of your procedures to pass omega so that it can be provided to all of your calls to DERIVS (which is the dummy argument you are associating with fcn).
An alternative would be to put omega into a module and use that module where you need access to omega, e.g. declare it in eos_parameters instead of declaring it in the scoping units of fcn and your main program.