Readability-wise, I find it preferable to write
momentum = sum( [( calculateMomentum(elements(i)), i=1, size(elements,1) )] )
over
momentum = 0.0d0
do i = 1, size(elements,1)
momentum = momentum + calculateMomentum(elements(i))
end do
because the first version has the form of defining the value of momentum, while the second corresponds to a more lower-level accumulation instruction. The difference becomes more pronounced in real-world code with more arguments and possibly multiple indices.
However, the first version allocates a temporary array. As a human programmer I know, that it could be optimized away, so I was wondering if Fortran offers a syntax, that allows calculating the sum with neither an explicit loop nor a temporary array.
Update
No such construct. It looks like there is no syntax, as what I was asking for. See Vladimir F's answer.
It matters, but less than I thought. I've made my own benchmark (pastebin, embedded) using matrix multiplication using several variants.
C(i,j) = C(i,j) + A(i,k) * B(k,j) was the slowest, probably due to the unnecessary array access in each step.
C(i,j) = sum( [(A(i,k) * B(k,j), k = 1, N)] )` was about 10-20% faster than (1), despite the temporary array.
tmp = tmp + A(i,k) * B(k,j), i.e. using a temporary accumulator variable, was about 20% faster than (2).
C = matmul(A,B) was the fastest by far, compared to (3) by a factor of 25 for 500x500 matrices, growing to 50 for 2000x2000, while the relative speed of the other variants stayed roughly the same.
Bottom line: When the task cannot be expressed in optimized library- or intrinsic functions easily, the sum variant has viable performance, and should only be optimized away, if performance really matters to such a degree in that part of the code.
Whether a temporary array will be allocated or net depends on the optimizations in the compiler. Stack allocation is almost free anyway. Copying the values will probably take longer.
The compiler may optimize unnecessary steps away if it can make sure the result will be the same. However, there is no special syntax for that. Fortran typically tries to stay far from the actual implementation and leaves a lot on the compiler.
For experiment, I've tried this code (which computes the sum of inverse of arr).
program main
use iso_fortran_env, only: dp => real64
implicit none
real(dp) val
real(dp), allocatable :: arr(:)
integer num, loop, i, t1, t2, trate
num = 10**8
arr = [( i, i = 1, num )] !! L1
do loop = 1, 10
call system_clock( t1 )
val = sum( [( testfunc( arr(i) ), i = 1, num )] ) !! L2
call system_clock( t2, trate )
print *, "val = ", val, " in ", (t2 - t1) / real(trate), " (s)"
enddo
contains
function testfunc( x ) result( ret )
real(dp), intent(in) :: x
real(dp) :: ret
ret = 1.0_dp / x
end
end program
Then, on my computer (mac2012), "gfortran-10 -O2 test.f90 && time ./a.out" gives
val = 18.997896413852555 in 1.02999997 (s)
val = 18.997896413852555 in 1.10099995 (s)
val = 18.997896413852555 in 1.17600000 (s)
...
real 0m12.575s
user 0m8.142s
sys 0m4.387s
and "gfortran-10 -O3" gives
val = 18.997896413852555 in 0.875000000 (s)
val = 18.997896413852555 in 0.888000011 (s)
val = 18.997896413852555 in 0.833000004 (s)
...
real 0m9.986s
user 0m5.738s
sys 0m4.210s
In both cases, the htop command shows ~1.5 GB allocated, which may be reasonable if lines L1 and L2 use a temporary array (each ~800 MB with ~0.3 s for allocation).
Because there is no syntax for creating "iterators", I've tried making testfunc() to be elemental (or impure elemental). The only difference here is the lines marked with <--.
program main
use iso_fortran_env, only: dp => real64
implicit none
real(dp) val
real(dp), allocatable :: arr(:)
integer num, loop, i, t1, t2, trate
num = 10**8
arr = [( i, i = 1, num )]
do loop = 1, 10
call system_clock( t1 )
val = sum( testfunc( arr ) ) !<--
call system_clock( t2, trate )
print *, "val = ", val, " in ", (t2 - t1) / real(trate), " (s)"
enddo
contains
impure elemental & !<--
function testfunc( x ) result( ret )
real(dp), intent(in) :: x
real(dp) :: ret
ret = 1.0_dp / x
end
end program
Then, "gfortran-10 -O2" gives
val = 18.997896413852555 in 0.437000006 (s)
val = 18.997896413852555 in 0.453999996 (s)
val = 18.997896413852555 in 0.437999994 (s)
...
real 0m5.946s
user 0m5.069s
sys 0m0.842s
and "gfortran-10 -O3" gives
val = 18.997896413852555 in 0.225999996 (s)
val = 18.997896413852555 in 0.252000004 (s)
val = 18.997896413852555 in 0.246999994 (s)
...
real 0m3.909s
user 0m3.009s
sys 0m0.867s
The htop command show ~800 MB, so it seems only arr is allocated.
For comparison, the following code calculates val with an explicit do-loop (using a scalar version of testfunc())
val = 0
do i = 1, num
val = val + testfunc( arr(i) )
enddo
which gave the same timing with the second code with elemental + testfunc(arr) above (on my mac).
In all the above code, the -Ofast option resulted in a runtime error (Illegal instruction). But this was due to the line L1 (arr = [( i, i = 1, num )]). If I allocate arr beforehand and populate it with an explicit loop, -Ofast also worked without problem (giving almost the same timing with -O3 in this case).
If you make calculateMomentum and elemental function, then it can be used for both scalar values and for arrays
for example with:
elemental function calculateMomentum(obj, v) result(p)
class(body), intent(in) :: obj
real, intent(in) :: v
real :: p
p = obj%mass * v
end function
you can apply the above to an array of v
integer::i
type(body) :: ball
real, allocatable :: v(:), p(:)
real :: tot_p
allocate(v(10))
v = [ (10+i, i=1, 10) ]
p = calculateMomentum(ball, v)
tot_p = sum(p)
having an intermediate array to hold the values is advantageous because it keeps the data close by (probably within the cache-line) and the sum() function would be as quick as it can be.
imagine the type body as follows for example
type body
real :: mass
end type
Question
Consider the following code:
program example
implicit none
integer, parameter :: n_coeffs = 1000
integer, parameter :: n_indices = 5
integer :: i
real(8), dimension(n_coeffs) :: coeff
integer, dimension(n_coeffs,n_indices) :: index
do i = 1, n_coeffs
coeff(i) = real(i*3,8)
index(i,:) = [2,4,8,16,32]*i
end do
end
For any 5 dimensional index I need to obtain the associated coefficient, without knowing or calculating i. For instance, given [2,4,8,16,32] I need to obtain 3.0 without computing i.
Is there a reasonable solution, perhaps using sparse matrices, that would work for n_indices in the order of 100 (though n_coeffs still in the order of 1000)?
A Bad Solution
One solution would be to define a 5 dimensional array as in
real(8), dimension(2000,4000,8000,16000,32000) :: coeff2
do i = 1, ncoeffs
coeff2(index(i,1),index(i,2),index(i,3),index(i,4),index(i,5)) = coeff(i)
end do
then, to get the coefficient associated with [2,4,8,16,32], call
coeff2(2,4,8,16,32)
However, besides being very wasteful of memory, this solution would not allow n_indices to be set to a number higher than 7 given the limit of 7 dimensions to an array.
OBS: This question is a spin-off of this one. I have tried to ask the question more precisely having failed in the first attempt, an effort that greatly benefited from the answer of #Rodrigo_Rodrigues.
Actual Code
In case it helps here is the code for the actual problem I am trying to solve. It is an adaptive sparse grid method for approximating a function. The main goal is to make the interpolation at the and as fast as possible:
MODULE MOD_PARAMETERS
IMPLICIT NONE
SAVE
INTEGER, PARAMETER :: d = 2 ! number of dimensions
INTEGER, PARAMETER :: L_0 = 4 ! after this adaptive grid kicks in, for L <= L_0 usual sparse grid
INTEGER, PARAMETER :: L_max = 9 ! maximum level
INTEGER, PARAMETER :: bound = 0 ! 0 -> for f = 0 at boundary
! 1 -> adding grid points at boundary
! 2 -> extrapolating close to boundary
INTEGER, PARAMETER :: max_error = 1
INTEGER, PARAMETER :: L2_error = 1
INTEGER, PARAMETER :: testing_sample = 1000000
REAL(8), PARAMETER :: eps = 0.01D0 ! epsilon for adaptive grid
END MODULE MOD_PARAMETERS
PROGRAM MAIN
USE MOD_PARAMETERS
IMPLICIT NONE
INTEGER, DIMENSION(d,d) :: ident
REAL(8), DIMENSION(d) :: xd
INTEGER, DIMENSION(2*d) :: temp
INTEGER, DIMENSION(:,:), ALLOCATABLE :: grid_index, temp_grid_index, grid_index_new, J_index
REAL(8), DIMENSION(:), ALLOCATABLE :: coeff, temp_coeff, J_coeff
REAL(8) :: temp_min, temp_max, V, T, B, F, x1
INTEGER :: k, k_1, k_2, h, i, j, L, n, dd, L1, L2, dsize, count, first, repeated, add, ind
INTEGER :: time1, time2, clock_rate, clock_max
REAL(8), DIMENSION(L_max,L_max,2**(L_max),2**(L_max)) :: coeff_grid
INTEGER, DIMENSION(d) :: level, LL, ii
REAL(8), DIMENSION(testing_sample,d) :: x_rand
REAL(8), DIMENSION(testing_sample) :: interp1, interp2
! ============================================================================
! EXECUTABLE
! ============================================================================
ident = 0
DO i = 1,d
ident(i,i) = 1
ENDDO
! Initial grid point
dsize = 1
ALLOCATE(grid_index(dsize,2*d),grid_index_new(dsize,2*d))
grid_index(1,:) = 1
grid_index_new = grid_index
ALLOCATE(coeff(dsize))
xd = (/ 0.5D0, 0.5D0 /)
CALL FF(xd,coeff(1))
CALL FF(xd,coeff_grid(1,1,1,1))
L = 1
n = SIZE(grid_index_new,1)
ALLOCATE(J_index(n*2*d,2*d))
ALLOCATE(J_coeff(n*2*d))
CALL SYSTEM_CLOCK (time1,clock_rate,clock_max)
DO WHILE (L .LT. L_max)
L = L+1
n = SIZE(grid_index_new,1)
count = 0
first = 1
DEALLOCATE(J_index,J_coeff)
ALLOCATE(J_index(n*2*d,2*d))
ALLOCATE(J_coeff(n*2*d))
J_index = 0
J_coeff = 0.0D0
DO k = 1,n
DO i = 1,d
DO j = 1,2
IF ((bound .EQ. 0) .OR. (bound .EQ. 2)) THEN
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(grid_index_new(k,d+i)-(-1)**j)/)
ELSEIF (bound .EQ. 1) THEN
IF (grid_index_new(k,i) .EQ. 1) THEN
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(-(-1)**j)/)
ELSE
temp = grid_index_new(k,:)+(/ident(i,:),ident(i,:)*(grid_index_new(k,d+i)-(-1)**j)/)
ENDIF
ENDIF
CALL XX(d,temp(1:d),temp(d+1:2*d),xd)
temp_min = MINVAL(xd)
temp_max = MAXVAL(xd)
IF ((temp_min .GE. 0.0D0) .AND. (temp_max .LE. 1.0D0)) THEN
IF (first .EQ. 1) THEN
first = 0
count = count+1
J_index(count,:) = temp
V = 0.0D0
DO k_1 = 1,SIZE(grid_index,1)
T = 1.0D0
DO k_2 = 1,d
CALL XX(1,temp(k_2),temp(d+k_2),x1)
CALL BASE(x1,grid_index(k_1,k_2),grid_index(k_1,k_2+d),B)
T = T*B
ENDDO
V = V+coeff(k_1)*T
ENDDO
CALL FF(xd,F)
J_coeff(count) = F-V
ELSE
repeated = 0
DO h = 1,count
IF (SUM(ABS(J_index(h,:)-temp)) .EQ. 0) THEN
repeated = 1
ENDIF
ENDDO
IF (repeated .EQ. 0) THEN
count = count+1
J_index(count,:) = temp
V = 0.0D0
DO k_1 = 1,SIZE(grid_index,1)
T = 1.0D0
DO k_2 = 1,d
CALL XX(1,temp(k_2),temp(d+k_2),x1)
CALL BASE(x1,grid_index(k_1,k_2),grid_index(k_1,k_2+d),B)
T = T*B
ENDDO
V = V+coeff(k_1)*T
ENDDO
CALL FF(xd,F)
J_coeff(count) = F-V
ENDIF
ENDIF
ENDIF
ENDDO
ENDDO
ENDDO
ALLOCATE(temp_grid_index(dsize,2*d))
ALLOCATE(temp_coeff(dsize))
temp_grid_index = grid_index
temp_coeff = coeff
DEALLOCATE(grid_index,coeff)
ALLOCATE(grid_index(dsize+count,2*d))
ALLOCATE(coeff(dsize+count))
grid_index(1:dsize,:) = temp_grid_index
coeff(1:dsize) = temp_coeff
DEALLOCATE(temp_grid_index,temp_coeff)
grid_index(dsize+1:dsize+count,:) = J_index(1:count,:)
coeff(dsize+1:dsize+count) = J_coeff(1:count)
dsize = dsize + count
DO i = 1,count
coeff_grid(J_index(i,1),J_index(i,2),J_index(i,3),J_index(i,4)) = J_coeff(i)
ENDDO
IF (L .LE. L_0) THEN
DEALLOCATE(grid_index_new)
ALLOCATE(grid_index_new(count,2*d))
grid_index_new = J_index(1:count,:)
ELSE
add = 0
DO h = 1,count
IF (ABS(J_coeff(h)) .GT. eps) THEN
add = add + 1
J_index(add,:) = J_index(h,:)
ENDIF
ENDDO
DEALLOCATE(grid_index_new)
ALLOCATE(grid_index_new(add,2*d))
grid_index_new = J_index(1:add,:)
ENDIF
ENDDO
CALL SYSTEM_CLOCK (time2,clock_rate,clock_max)
PRINT *, 'Elapsed real time1 = ', DBLE(time2-time1)/DBLE(clock_rate)
PRINT *, 'Grid Points = ', SIZE(grid_index,1)
! ============================================================================
! Compute interpolated values:
! ============================================================================
CALL RANDOM_NUMBER(x_rand)
CALL SYSTEM_CLOCK (time1,clock_rate,clock_max)
DO i = 1,testing_sample
V = 0.0D0
DO L1=1,L_max
DO L2=1,L_max
IF (L1+L2 .LE. L_max+1) THEN
level = (/L1,L2/)
T = 1.0D0
DO dd = 1,d
T = T*(1.0D0-ABS(x_rand(i,dd)/2.0D0**(-DBLE(level(dd)))-DBLE(2*FLOOR(x_rand(i,dd)*2.0D0**DBLE(level(dd)-1))+1)))
ENDDO
V = V + coeff_grid(L1,L2,2*FLOOR(x_rand(i,1)*2.0D0**DBLE(L1-1))+1,2*FLOOR(x_rand(i,2)*2.0D0**DBLE(L2-1))+1)*T
ENDIF
ENDDO
ENDDO
interp2(i) = V
ENDDO
CALL SYSTEM_CLOCK (time2,clock_rate,clock_max)
PRINT *, 'Elapsed real time2 = ', DBLE(time2-time1)/DBLE(clock_rate)
END PROGRAM
For any 5 dimensional index I need to obtain the associated
coefficient, without knowing or calculating i. For instance, given
[2,4,8,16,32] I need to obtain 3.0 without computing i.
function findloc_vector(matrix, vector) result(out)
integer, intent(in) :: matrix(:, :)
integer, intent(in) :: vector(size(matrix, dim=2))
integer :: out, i
do i = 1, size(matrix, dim=1)
if (all(matrix(i, :) == vector)) then
out = i
return
end if
end do
stop "No match for this vector"
end
And that's how you use it:
print*, coeff(findloc_vector(index, [2,4,8,16,32])) ! outputs 3.0
I must confess I was reluctant to post this code because, even though this answers your question, I honestly think this is not what you really want/need, but you dind't provide enough information for me to know what you really do want/need.
Edit (After actual code from OP):
If I decrypted your code correctly (and considering what you said in your previous question), you are declaring:
REAL(8), DIMENSION(L_max,L_max,2**(L_max),2**(L_max)) :: coeff_grid
(where L_max = 9, so size(coeff_grid) = 21233664 =~160MB) and then populating it with:
DO i = 1,count
coeff_grid(J_index(i,1),J_index(i,2),J_index(i,3),J_index(i,4)) = J_coeff(i)
ENDDO
(where count is of the order of 1000, i.e. 0.005% of its elements), so this way you can fetch the values by its 4 indices with the array notation.
Please, don't do that. You don't need a sparse matrix in this case either. The new approach you proposed is much better: storing the indices in each row of an smaller array, and fetching on the array of coefficients by the corresponding location of those indices in its own array. This is way faster (avoiding the large allocation) and much more memory-efficient.
PS: Is it mandatory for you to stick to Fortran 90? Its a very old version of the standard and chances are that the compiler you're using implements a more recent version. You could improve the quality of your code a lot with the intrinsic move_alloc (for less array copies), the kind constants from the intrinsic module iso_fortran_env (for portability), the [], >, <, <=,... notation (for readability)...
My goal is to create 10,000 randomly generated numbers between 0 and 1, organize them into ten bins evenly spaced between 0 and 1, and compute a frequency for each bin. This is my code so far.
program listrand
implicit none
integer :: n,p
integer :: a,b,c,d,e,f,g,h,i,j = 0
real :: xran
!real, dimension(10,2) :: bin_and_freq -- list of bins and frequency
do n = 1,10000
call random_number(xran)
if (xran < 0.1) then
a = a + 1
elseif (xran>0.1 .and. xran<0.2) then
b = b + 1
elseif (xran>0.2 .and. xran<0.3) then
c = c+1
elseif (xran>0.3 .and. xran<0.4) then
d = d+1
elseif (xran>0.4 .and. xran<0.5) then
e = e + 1
elseif (xran>0.5 .and. xran<0.6) then
f = f+1
elseif (xran>0.6 .and. xran<0.7) then
g = g+1
elseif (xran>0.7 .and. xran<0.8) then
h=h+1
elseif (xran>0.8 .and. xran<0.9) then
i=i+1
else
j = j+1
endif
enddo
print *, a,b,c,d,e,f,g,h,i,j
end program listrand
I am getting an unexpected output:
988 1036 133225987 1004 934 986 1040 33770 1406729616 1052.
Why are c,h, and i so large? Also, is there a more efficient way of going about this than using the unwieldy IF/ELSEIF block I have?
In your long
integer :: a,b,c,d,e,f,g,h,i,j = 0
You are only initialising j to be 0, all others have random numbers in them. If you add
a = 0
b = 0
c = 0
d = 0
e = 0
f = 0
g = 0
h = 0
i = 0
j = 0
before your loop, everything works well.
As for how to simplify it:
Here is my version of the program:
program listrand
implicit none
integer, parameter :: nbins = 10
integer :: n, bin
integer :: bin_hits(nbins) ! Number of bin hits
real :: xran
real :: bin_lower(nbins) ! Lower edge of bins
! bin_lower(1) == 0.0
bin_hits = 0
! Set up equidistant bins
bin_lower = [ (real(n-1) / nbins, n = 1, size(bin_lower)) ]
do n = 1,10000
call random_number(xran)
bin = count(bin_lower <= xran)
bin_hits(bin) = bin_hits(bin)+1
enddo
do n = 1, nbins-1
print '(2(F6.2), I6)' bin_lower(n), bin_lower(n+1), bin_hits(n)
end do
print '(2(F6.2), I6)' bin_lower(nbins), 1.0, bin_hits(nbins)
end program listrand
For the index of which bin_hits element to increment, I'm counting the number of values in bin_lower that are actually lower than xran.
EDIT
I'd like to also point to the answer from High Performance Mark a bit further down, who instead of calling RANDOM_NUMBER for each value individually uses it to generate a whole array of random numbers.
Additionally, he's using the fact that the bins are fixed and equidistant to calculate the bin number directly from the random value instead of comparing it to each bin as in my version.
Both of these make the program faster.
If speed of execution is one's main concern, and if one is willing to trade space for time, this might appeal:
PROGRAM listrand
IMPLICIT NONE
INTEGER, PARAMETER :: nbins = 10
INTEGER, PARAMETER :: nsamples = 10**4
INTEGER :: bin_hits(0:nbins-1)
REAL :: xran(nsamples)
INTEGER :: binned_rn(nsamples), n
bin_hits = 0
CALL RANDOM_NUMBER(xran)
binned_rn = INT(nbins*xran)
DO n = 1, nsamples
bin_hits(binned_rn(n)) = bin_hits(binned_rn(n)) +1
END DO
WRITE(*,*) bin_hits
END PROGRAM listrand
In a limited number of tests this version is 3 - 4 times as fast as #chw21's version.
I'm creating a program that is required to read values from two arrays (ARR and MRK), counting each set of values (I,J) in order to determine their frequency for a third array (X). I've written the following so far, but nesting errors are preventing the program from compiling. Any help is greatly appreciated!
IMPLICIT NONE
REAL, DIMENSION (0:51, 0:51) :: MRK, ALT
INTEGER :: I, J !! FREQUENCY ARRAY ALLELES
INTEGER, PARAMETER :: K = 2
INTEGER :: M, N !! HAPLOTYPE ARRAY POSITIONS
INTEGER :: COUNTER = 0
REAL, DIMENSION(0:1,0:K-1):: X
ALT = 8
MRK = 8
X = 0
MRK(1:50,1:50) = 0 !! HAPLOTYPE ARRAY WITHOUT BUFFER AROUND OUTSIDE
ALT(1:50,1:50) = 0
DO I = 0, 1 !! ALTRUIST ALLELE
DO J = 0, K-1 !! MARKER ALLELE
DO M = 1, 50
DO N = 1, 50 !! READING HAPLOTYPE POSITIONS
IF ALT(M,N) = I .AND. MRK(M,N) = J THEN
COUNTER = COUNTER + 1
ELSE IF ALT(M,N) .NE. I .OR. MRK(M,N) .NE. J THEN
COUNTER = COUNTER + 0
END IF
X(I,J) = COUNTER/2500
COUNTER = 0
END DO
END DO
END DO
END DO
Your if syntax is incorrect. You should enclose the conditional expressions between brackets. Also, I think you should replace single = by a double == in the same expressions and maybe keep the syntax type to either == and /= or .eq. and .neq., but not mix them:
IF (ALT(M,N) == I .AND. MRK(M,N) == J) THEN
COUNTER = COUNTER + 1
ELSE IF (ALT(M,N) /= I .OR. MRK(M,N) /= J) THEN
COUNTER = COUNTER + 0
END IF
I don't know if in your actual program you do it, but you should probably use program program_name and end program program_name at the very beginning and very end of your code, respectively, where program_name is anything you want to call your program (no spaces allowed I think), although a simple end at the end would suffice.