C[a,d,e] = A[a,b,c] * B[d,b,c,e]
The topic is relate to BLAS tensor contractions for two indexes together , but more complicated.
I can only use do loop to call gemm for many times. The data structure of C and B is not contiguous in current implementation.
I want to find a more efficient way to compute C[a,d,e] = A[a,b,c] * B[d,b,c,e]. Hopefully, call gemm only one time.
Program double_contractions_blas
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Implicit None
Real( wp ), Dimension( :, :, : ), Allocatable :: a
Real( wp ), Dimension( :, :, :, : ), Allocatable :: b
Real( wp ), Dimension( :, :, : ), Allocatable :: c
Real( wp ), Dimension( :, : ), Allocatable :: d
Real( wp ), Dimension( :, : ), Allocatable :: e
Integer :: na, nb, nc, nd, ne, nf, ng
Integer :: i
Integer( li ) :: start, finish, rate
Write( *, * ) 'na, nb, nc, nd, ne ?'
Read( *, * ) na, nb, nc, nd, ne
allocate( a( na, nb, nc ) )
allocate( b( nd, nb, nc, ne ) )
allocate( c( na, nd, ne ) )
nf = nd * ne
ng = nb * nc
Call Random_number( a )
Call Random_number( b )
Call System_clock( start, rate )
Do i = 1, nd
Call dgemm( 'N', 'N', na, ne, ng, 1.0_wp, a , Size( a , Dim = 1 ), &
b(i,:,:,:) , ng, &
0.0_wp, c(:,i,:), Size( c, Dim = 1 ) )
end do
Call System_clock( finish, rate )
write( *, * ) c
end program double_contractions_blas
Related
I tried to write the python code in Optimizing array additions and multiplications with transposes
in Fortran to see if I can achieve any speed up (-O3 helps a lot; the approach in Ian Bush's answer in Transposition of a matrix by multithread in Fortran, seems too complicated to me). E.g.,
0.1 * A(l1,l2,l3,l4) + 0.2*A(l1,l2,l4,l3) + 0.3 * A(l1,l3,l2,l4)+...
If I tried to extend from
Program transpose
integer, parameter :: dp = selected_real_kind(15, 307)
real(dp), dimension(:, :, :, :), allocatable :: a, b
Integer :: n1, n2, n3, n4, n, m_iter
Integer :: l1, l2, l3, l4
Integer(8) :: start, finish, rate
real(dp) :: sum_time
Write(*, *) 'n1, n2, n3, n4?'
Read(*, *) n1, n2, n3, n4
Allocate( a ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
Allocate( b ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
call random_init(.true., .false.)
Call Random_number( a )
m_iter = 100
b = 0.0_dp
Call system_clock( start, rate )
do n = 1, m_iter
do l4 = 1, n4
do l3 = 1, n3
do l2 = 1, n2
do l1 = 1, n1
b(l1,l2,l3,l4) = 0.1_dp*a(l1,l2,l3,l4) + 0.2_dp*a(l1,l2,l4,l3)
end do
end do
end do
end do
end do
Call system_clock( finish, rate )
sum_time = real( finish - start, dp ) / rate
write (*,*) 'all loop', sum_time/m_iter
print *, b(1,1,1,1)
End
(I tried reshape, slower than nested loops)
Is there any simple way to include A(l1,l3,l2,l4), A(l1,l3,l4,l2) etc? I can use Python to generate a strings to include all of them with \ for changing lines.
A potential complexity is, if there is a term 0.0 * A(l4,l3,l2,l1), and I would like to skip it, generate a string from python is complicated. Any more Fortran-like solution?
Another issue is, if the array A has different dimension in each index, say, n1 != n2 != n3 != n4, some permutation may out of bound. In this situation, the prefactor will be zero. For example, if n1 = n2 = 10, n3 = n4 = 20, it will be something like 0.1 * A(l1,l2,l3,l4) + 0.0 * A(l1,l3,l2,l4). In aother word, b = 0.1*a + 0.0*reshape(a, (/n1, n2, n3, n4/), order = (/1,3,2,4/) ) , or say 0.1*a + 0.0 * P(2,3) a, where P is a permutation operator. By checking the absolute value of permutation prefactor below some threshold, the summation would be able to skip that permutation.
In this case, the prefactor will be zero. The summation is supposed to skip that type of permutation.
Edited: a python reference implementation is below. I include a random and non-random version by the variable, gen_random. The latter may be eaiser to check.
import numpy as np
import time
import itertools as it
ref_list = [0, 1, 2, 3]
p = it.permutations(ref_list)
transpose_list = tuple(p)
n_loop = 2
na = nb = nc = nd = 30
A = np.zeros((na,nb,nc,nd))
gen_random = False
if gen_random == False:
n = 1
for la in range(na):
for lb in range(nb):
for lc in range(nc):
for ld in range(nd):
A[la,lb,lc,ld] = n
n = n + 1
else:
A = np.random.random((na,nb,nc,nd))
factor_list = [(i+1)*0.1 for i in range(24)]
time_total = 0.0
for n in range(n_loop):
sum_A = np.zeros((na,nb,nc,nd))
start_0 = time.time()
for m, t in enumerate(transpose_list):
sum_A = np.add(sum_A, factor_list[m] * np.transpose(A, transpose_list[m] ), out = sum_A)
#sum_A += factor_list[m] * np.transpose(A, transpose_list[m])
finish_0 = time.time()
time_total += finish_0 - start_0
print('level 4', time_total/n_loop)
print('Ref value', A[0,0,0,0], sum_A[0,0,0,0])
As a sanity check, if A[0,0,0,0] is non-zero, sum_A[0,0,0,0]/A[0,0,0,0] = 30, by 0.1 + 0.2 +... + 2.4 = (0.1+2.4)*2.4/2=30. Though the permutation factors can be different, the above is just an example.
Here's one way that I think does it in Fortran, which also skips terms for which the prefactor is zero. I make no claims as to it being the best, there are many ways to do it. I also am hesitant to claim its correctness, what you have provided it makes it difficult to assess this fully. But it does pass the sanity test for the case when all the sizes are the same ... You give no way to check the more general case.
The main problem is Fortran provides no way to create the permutations you require, so I've written a little module which I believe implements the same ordering as python. This is ordering is taken from the python documentation and the algorithm to implement it from wikipedia. Unit testing strongly suggests it does the job.
Once you have that it is easy to loop over each permutation in turn skipping those that have zero weight, either because the prefactor is zero, or the shapes are incompatible. So with all the caveats above here's my effort along with the compiler version and a few tests, some with array bounds checking on, some without.
Note even if this is correct I certainly make not claims as to how optimal it is - the memory access pattern is very non-trivial, and optimising that will require much more thought than I am willing to give this now, though I suspect cache blocking will be required, as in the matrix transposition question that you reference.
Module permutations_module
! Little module to handle permutations of an arbitrary sized list of integer 1, 2, 3, .... n
Implicit None
Type, Public :: permutation
Private
Integer, Dimension( : ), Allocatable, Private :: state
Contains
Procedure, Public :: init
Procedure, Public :: get
Procedure, Public :: next
End type permutation
Private
Contains
Subroutine init( p, n )
! Initalise a permutation
Class( permutation ), Intent( Out ) :: p
Integer , Intent( In ) :: n
Integer :: i
Allocate( p%state( 1:n ) )
p%state = [ ( i, i = 1, Size( p%state ) ) ]
End Subroutine init
Pure Function get( p ) Result( a )
! Get the current permutation
Class( permutation ), Intent( In ) :: p
Integer, Dimension( : ), Allocatable :: a
a = p%state
End Function get
Function next( p ) Result( finished )
! Move onto the next permutation, returning .True. if there are no more permutations in the list
Logical :: finished
Class( permutation ), Intent( InOut ) :: p
Integer :: k, l
Integer :: tmp
finished = .False.
Do k = Size( p%state ) - 1, 1, -1
If( p%state( k ) < p%state( k + 1 ) ) Exit
End Do
finished = k == 0
If( .Not. finished ) Then
Do l = Size( p%state ), k + 1, -1
If( p%state( k ) < p%state( l ) ) Exit
End Do
tmp = p%state( k )
p%state( k ) = p%state( l )
p%state( l ) = tmp
p%state( k + 1: ) = p%state( Size( p%state ):k + 1: - 1 )
End If
End Function next
End Module permutations_module
Program testit
Use, Intrinsic :: iso_fortran_env, Only : wp => real64
Use permutations_module, Only : permutation
Implicit None
Integer, Parameter :: n_iter = 100
Type( permutation ) :: p
Integer :: i
Real( wp ), Dimension( :, :, :, : ), Allocatable :: a
Real( wp ), Dimension( :, :, :, : ), Allocatable :: b
Real( wp ), Dimension( 1:Product( [ ( i, i = 1, Size( Shape( a ) ) ) ] ) ) :: c
Integer, Dimension( 1:Size( Shape( a ) ) ) :: this_permutation
Integer, Dimension( 1:Size( Shape( a ) ) ) :: sizes
Integer, Dimension( 1:Size( Shape( a ) ) ) :: permuted_sizes
Integer, Dimension( 1:Size( Shape( a ) ) ) :: indices
Integer, Dimension( 1:Size( Shape( a ) ) ) :: permuted_indices
Integer :: n1, n2, n3, n4
Integer :: l1, l2, l3, l4
Integer :: iter
Integer :: start, finish, rate
Logical :: finished
c = [ ( i * 0.1_wp, i = 1, Size( c ) ) ]
Write( *, * ) 'n1, n2, n3, n4?'
Read ( *, * ) n1, n2, n3, n4
Allocate( a ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
Allocate( b, Mold = a )
Call Random_init( .true., .false. )
Call Random_number( a )
! Make sure a( 1, 1, 1, 1 ) is not zero for the sanity check
a( 1, 1, 1, 1 ) = a( 1, 1, 1, 1 ) + 0.1_wp
Call system_clock( start, rate )
sizes = Shape( a )
b = 0.0_wp
iter_loop: Do iter = 1, n_iter
Call p%init( Size( Shape( a ) ) )
i = 0
finished = .False.
permutation_loop: Do While( .Not. finished )
i = i + 1
! Get the next permutation
finished = p%next()
! Only do it if it has any weight
If( Abs( c( i ) ) > Epsilon( c( i ) ) ) Then
! Get the current permutation
this_permutation = p%get()
! Check the shapes are compatible
permuted_sizes = sizes( this_permutation )
If( All( permuted_sizes == sizes ) ) Then
! Add in the current permutation
Do l4 = 1, n4
Do l3 = 1, n3
Do l2 = 1, n2
Do l1 = 1, n1
indices = [ l1, l2, l3, l4 ]
permuted_indices = indices( this_permutation )
b( indices( 1 ), indices( 2 ), indices( 3 ), indices( 4 ) ) = &
b(indices( 1 ), indices( 2 ), indices( 3 ), indices( 4 ) ) + &
c( i ) * a( permuted_indices( 1 ), permuted_indices( 2 ), &
permuted_indices( 3 ), permuted_indices( 4 ) )
End Do
End Do
End Do
End Do
End If
End If
End Do permutation_loop
End Do iter_loop
Call system_clock( finish, rate )
Write( *, * ) 'time per iteration = ', Real( finish - start ) / Real( rate ) / Real( n_iter )
Write( *, * ) 'Sanity ', b( 1, 1, 1, 1 ) / a( 1, 1, 1, 1 ) / n_iter
End Program testit
ijb#ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb#ijb-Latitude-5410:~/work/stack$ gfortran -fcheck=all -Wall -Wextra -O3 -g -std=f2018 perm_mod_single_thread.f90
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
n1, n2, n3, n4?
10 10 10 10
time per iteration = 1.47000002E-03
Sanity 30.000000000000259
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
n1, n2, n3, n4?
11 10 9 12
time per iteration = 0.00000000
Sanity 0.0000000000000000
ijb#ijb-Latitude-5410:~/work/stack$ gfortran -Wall -Wextra -O3 -g -std=f2018 perm_mod_single_thread.f90
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
n1, n2, n3, n4?
30 30 30 30
time per iteration = 6.56599998E-02
Sanity 29.999999999999844
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
n1, n2, n3, n4?
60 60 60 60
time per iteration = 2.46800995
Sanity 30.000000000000036
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
n1, n2, n3, n4?
10 20 30 40
time per iteration = 2.00000013E-05
Sanity 0.0000000000000000
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
n1, n2, n3, n4?
30 30 15 15
time per iteration = 1.50999997E-03
Sanity 1.4000000000000050
ijb#ijb-Latitude-5410:~/work/stack$
The background of my question is related to Optimizing array additions and multiplications with transposes
I am thinking about optimizng 0.1*A + 0.1*transpose(A,(1,0)) (possibly with more general transpose) by Fortran pointer, where A is an array. (transpose in python sense, seems related to reshape in Fortran)
I am not sure if tranposing/multiplying value via pointer will be faster than using array. I thought using pointer may restricted operation within given memory locations. If I use
b = 0.1*a + 0.1*reshape(a, (/n1, n2, n3, n4/), order = (/2,1,3,4/) )
, reshape may be associated to different memory location.
Here is my code
Program transpose_test
use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
integer, parameter :: dp = selected_real_kind(15, 307)
! Implicit None
real(dp), Dimension( :, :, :, : ), Allocatable :: a, b
integer :: n1, n2, n3, n4, i, m, n, m_iter
integer :: l1, l2, l3, l4
integer(li) :: start, finish, rate
real(dp) :: sum_time
real(dp), target, allocatable :: at(:,:,:,:)
real(dp), pointer :: ap(:,:,:,:), bp(:,:,:,:)
Write( *, * ) 'n1, n2, n3, n4?'
Read( *, * ) n1, n2, n3, n4
Allocate( a ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
i = 0
do l1 = 1, n1
do l2 = 1, n2
do l3 = 1, n3
do l4 = 1, n4
a(l1, l2, l3, l4) = i
i = i + 1
end do
end do
end do
end do
at = a
ap => at
bp => ap
!print *, at
print *, 'ap', ap
print *, 'bp', bp
sum_time = 0.0
do n = 1, m_iter
Call System_clock( start, rate )
do l2 = 1, n2
do l1 = 1, n1
bp(l1,l2,:,:) => 0.1*ap(l1,l2,:,:) + 0.1*ap(l2,l1,:,:)
end do
end do
Call System_clock( finish, rate )
sum_time = sum_time + Real( finish - start, dp ) / rate
end do
write (*,*) 'reshape pointer time', sum_time
print *, 'bp', bp
End
gfotran 9.3.0 gives
56 | bp(l1,l2,:,:) => 0.1*ap(l1,l2,:,:) + 0.1*ap(l2,l1,:,:)
| 1
Error: Expected list of ‘lower-bound :’ or list of ‘lower-bound : upper-bound’ specifications at (1)
What would be the solution for the above error message? Generally, will the above approach out perform tranposing array, e.g.,
b = 0.1*a + 0.1*reshape(a, (/n1, n2, n3, n4/), order = (/2,1,3,4/) )
and comparing with numpy realization related to the question in the first paragraph.
?
I am not experienced programer in fortran so I need a help about my simple code.
My code is:
module derived_type
implicit none
type :: iter_type
integer :: calc_tmp
integer :: n_iter
contains
procedure :: calc_iter => calc_iter_process
procedure :: take_calc_tmp => take_data_calc_tmp
procedure :: take_n_iter => take_data_n_iter
end type iter_type
private :: calc_iter_process
private :: take_data_calc_tmp
private :: take_data_n_iter
contains
function calc_iter_process( this, indx_00 ) result( err_tmp )
class( iter_type ) :: this
integer, intent( in ) :: indx_00
logical :: err_tmp
err_tmp = .false.
this%n_iter = 0
this%calc_tmp = 1
do while( this%calc_tmp < indx_00 )
this%n_iter = this%n_iter + 1
if ( this%n_iter > 50 ) then
write(*,*) "error - maximal number of iterations !!!"
err_tmp = .true.
exit
end if
this%calc_tmp = this%calc_tmp + 1
end do
end function calc_iter_process
function take_data_calc_tmp( this ) result( data_tmp )
class( iter_type ) :: this
integer :: data_tmp
data_tmp = this%calc_tmp
end function take_data_calc_tmp
function take_data_n_iter( this ) result( data_tmp )
class( iter_type ) :: this
integer :: data_tmp
data_tmp = this%n_iter
end function take_data_n_iter
end module derived_type
program iteration_values
use, non_intrinsic :: derived_type
implicit none
integer, parameter :: number_00 = 32
logical :: global_err
type( iter_type ) :: iter_object
global_err = iter_object%calc_iter( number_00 )
if ( global_err ) stop "error - global !!!"
end program iteration_values
I need to find way for code modification which can give me a way to keep or save value of 'calc_tmp' in every single iterations.
When I thinking about that I can not imagine how to allocate or deallocate some array which must be dimension same or higher the 'n_iter'.
Is there way for doing that?
I would recommend the use of the allocatable attribute and move_alloc. Here is an example program. move_alloc is Fortran 2003. In this example, I'm increasing the size of the array every time its size is exceeded.
program temp
implicit none
integer, dimension(:), allocatable :: tempval, calc_tmp_history
integer :: i, j, calc_tmp, totalSize
totalSize = 0
allocate(calc_tmp_history(2))
do i = 1,4
calc_tmp = 2*i
if (i > size(calc_tmp_history)) then
call move_alloc(calc_tmp_history,tempval)
allocate(calc_tmp_history(2*i))
do j = 1,i
calc_tmp_history(j) = tempval(j)
end do
end if
calc_tmp_history(i) = calc_tmp
totalSize = totalSize + 1
end do
do i = 1,totalSize
print *, calc_tmp_history(i)
end do
end program
Output from this is:
2
4
6
8
I am new to Fortran so I would like to have some insight regarding the allocation of dynamic memory
I read about dynamic memory allocation and various sources have a different take to this subject. For example, one book states that every single block of allocated dynamic memory must be deallocated at the end of the program to avoid memory leaks. However, other sources (books and various web pages) claim that is invalid as compilers (gfortran and alike) deallocate
all dynamic objects, arrays, etc automatically at the end of the program.
So in my sample code, I do not know if there is a need to deallocate dynamic array NN_VOD from CALCULATE_DATA_DM procedure.
What do I need to do with this sample code if I want to avoid memory leak and are there any memory leak in this code? (My IDE is Code::Blocks 17.12 with MinGW compiler 6.3.0)
MODULE DERRIVED_TYPE_TMP
INTEGER, PUBLIC :: I, J, K, ALLOC_ERR
TYPE, PUBLIC :: DM_ELEMENT
CHARACTER( 50 ), PRIVATE :: ELE_NAME
INTEGER, PRIVATE :: ELE_NUMBER
CONTAINS
PROCEDURE, PUBLIC :: CALCULATE_ELEMENT => CALCULATE_DATA_ELEMENT
END TYPE DM_ELEMENT
PRIVATE :: CALCULATE_DATA_ELEMENT
TYPE, EXTENDS(DM_ELEMENT), PUBLIC :: VOD_DM
INTEGER, ALLOCATABLE, PRIVATE :: NN_VOD( : )
CONTAINS
PROCEDURE, PUBLIC :: CALCULATE_ELEMENT => CALCULATE_DATA_DM
PROCEDURE, PUBLIC :: TAKE_DM => TAKE_DATA_DM
END TYPE VOD_DM
PRIVATE :: CALCULATE_DATA_DM
PRIVATE :: TAKE_DATA_DM
CONTAINS
SUBROUTINE CALCULATE_DATA_ELEMENT ( THIS, NUMBER_TMP )
CLASS( DM_ELEMENT ) :: THIS
INTEGER, INTENT( IN ) :: NUMBER_TMP
END SUBROUTINE CALCULATE_DATA_ELEMENT
SUBROUTINE CALCULATE_DATA_DM( THIS, NUMBER_TMP )
CLASS( VOD_DM ) :: THIS
INTEGER, INTENT( IN ) :: NUMBER_TMP
IF ( .NOT. ALLOCATED( THIS%NN_VOD ) ) ALLOCATE( THIS%NN_VOD( NUMBER_TMP ), STAT = ALLOC_ERR )
IF ( ALLOC_ERR .NE. 0 ) STOP ( "PROBLEM SA ALOKACIJOM MEMORIJE - THIS%T !!!" )
DO J = 1, NUMBER_TMP
THIS%NN_VOD( J ) = J + NUMBER_TMP
END DO
END SUBROUTINE CALCULATE_DATA_DM
FUNCTION TAKE_DATA_DM( THIS, INDX ) RESULT( RESULT_TMP )
CLASS( VOD_DM ) :: THIS
INTEGER, INTENT( IN ) :: INDX
INTEGER :: RESULT_TMP
RESULT_TMP = THIS%NN_VOD( INDX )
END FUNCTION TAKE_DATA_DM
END MODULE DERRIVED_TYPE_TMP
PROGRAM DO_LOOP_ALLOCATION
USE, NON_INTRINSIC :: DERRIVED_TYPE_TMP
IMPLICIT NONE
INTEGER, PARAMETER :: N_NN_DM = 3
INTEGER, PARAMETER :: AN_NN_DM( N_NN_DM ) = [ 2, 3, 4 ]
TYPE :: NN_VOD
TYPE( VOD_DM ), ALLOCATABLE :: ID( : )
END TYPE NN_VOD
CLASS( DM_ELEMENT ), POINTER :: P_DM_ELEMENT
TYPE ( NN_VOD ), ALLOCATABLE, TARGET :: PAR_NN_VOD( : )
IF ( .NOT. ALLOCATED( PAR_NN_VOD ) ) ALLOCATE( PAR_NN_VOD( N_NN_DM ), STAT = ALLOC_ERR )
IF ( ALLOC_ERR .NE. 0 ) STOP ( "ALLOCATION ERROR - PAR_NN_VOD !!!" )
DO K = 1, N_NN_DM
IF ( .NOT. ALLOCATED( PAR_NN_VOD( K )%ID ) ) ALLOCATE( PAR_NN_VOD( K )%ID( AN_NN_DM( K ) ), STAT = ALLOC_ERR )
IF ( ALLOC_ERR .NE. 0 ) STOP ( "ALLOCATION ERROR - PAR_NN_VOD%ID !!!")
END DO
DO K = 1, N_NN_DM
DO I = 1, AN_NN_DM( K )
P_DM_ELEMENT => PAR_NN_VOD( K )%ID( I )
CALL P_DM_ELEMENT%CALCULATE_ELEMENT( K + I )
END DO
END DO
END PROGRAM DO_LOOP_ALLOCATION
From Fortran95 onwards the language is designed so that with a standard conforming compiler it is impossible to have a memory leak when using allocatable arrays, as once an allocatable object goes out of scope it becomes deallocated. This is one of the big advantages of allocatable arrays, and one of the reasons why they should always be used in preference to pointers where possible. Now when a variable goes out of scope may well be long after a variable stops being used, and so you may wish to manually deallocate earlier to save memory, but there is no need to deallocate purely to avoid a memory leak. Thus in your code use allocatable arrays and there will be no memory leak.
In Fortran 90 this was not true, memory leaks with allocatables were possible. But this standard has long been superseded by Fortran 95 and it, and thus Fortran 90 and all earlier standards should not be being used today.
I have to calculate the moving average of a masked dataset with dimensions (7320,8520) in Fortran. I wrote a subroutine that receives the data (TS) and outputs the averaged data (TS_NEW). The problem is that the code is taking too long to run (it actually never finishes, despite not running into memory issues). I wonder if there's a way to make the code more efficient. Below is the code I wrote:
SUBROUTINE avgwin(ts,winsize,size1,size2,sizelat,sizelon,ts_new)
implicit none
double precision, dimension(size1,size2),INTENT(IN) :: ts
double precision, dimension(winsize,winsize) :: store
double precision, dimension(sizelon,sizelat),INTENT(OUT) :: ts_new
integer :: j,k
integer :: A, B
integer,INTENT(IN) :: winsize,size1,size2,sizelat,sizelon
logical, dimension(size1,size2) :: mask,mask2
double precision :: SUMVAR, COUNTVAR
A=1
B=1
mask = ts > 0 !Mask to highlight all the OK values
mask2 = ts < 0 !Mask to highlight all the values to be discarded
do j=1,sizelat !Looping through latitude
do k=1,sizelon !Looping through longitude
if (ALL(mask2(k:k+winsize-1,j:j+winsize-1)) .eqv. .true.) then
ts_new(B,A) = -100 !Adds a fill value if all the elements are to be discarded
B=B+1
else
SUMVAR = sum(ts(k:k+winsize-1,j:j+winsize-1), MASK=mask(k:k+winsize-1,j:j+winsize-1))
COUNTVAR = count(mask(k:k+winsize-1,j:j+winsize-1))
ts_new(B,A) = SUMVAR/COUNTVAR
B=B+1
end if
end do
B=1
A=A+1
end do
END SUBROUTINE
program test
implicit none
double precision, dimension(7320,8520) :: DATA
double precision, dimension(:,:),allocatable :: DATA_NEW
integer :: sizelat, sizelon, i, j, len1, len2, winsize
integer, dimension(3) :: sizes
len1 = 7320
len2 = 8520
do i=1,8520
do j=1,7320
DATA(j,i)= i !Just for testing purposes
end do
end do
sizes(1:3) = (/300,301,302/)
do w=1,3
winsize = sizes(w)
sizelon = len1-winsize+1
sizelat = len2-winsize+1
allocate(DATA_NEW(sizelon,sizelat))
CALL avgwin(DATA,winsize,len1,len2,sizelat,sizelon,DATANEW)
end do
end program test
Though not sure if this meets the OP's purpose, how about first collecting data along one dimension and then collecting the processed data again along another dimension (i.e., partial summation)? For example, if we consider a simpler problem of summing data( 1:L, 1:L ) over moving window of size w, there may be three different ways to achieve this:
program main
implicit none
real, allocatable, dimension(:,:) :: data, direct, part1, part2
integer :: i1, i2, L, S, w
real :: t1, t2
L = 2000
w = 50
S = L - w + 1
allocate( data( L, L ), direct( S, S ), &
part1( L, S ), part2( S, S ) )
!> test data
do i2 = 1, L
do i1 = 1, L
data( i1, i2 ) = mod( i1 + i2, 2 )
enddo
enddo
!> method 1: direct sum (cost = O( S^2 * w^2 ))
call cpu_time( t1 )
do i2 = 1, S
do i1 = 1, S
direct( i1, i2 ) = sum( data( i1:(i1 + w - 1), i2:(i2 + w - 1) ) )
enddo
enddo
call cpu_time( t2 )
print *, "time (s) = ", t2 - t1
!> method 2: partial sum (cost = O( S^2 * w * 2 ))
call cpu_time( t1 )
do i2 = 1, S
do i1 = 1, L
part1( i1, i2 ) = sum( data( i1, i2:(i2 + w - 1) ) )
enddo
enddo
do i2 = 1, S
do i1 = 1, S
part2( i1, i2 ) = sum( part1( i1:(i1 + w - 1), i2 ) )
enddo
enddo
call cpu_time( t2 )
print *, "time (s) = ", t2 - t1
print *, "error = ", maxval( abs( part2 - direct ) )
!> method 3: an improved version of method 2 (cost = O( S^2 ))
call cpu_time( t1 )
do i1 = 1, L
part1( i1, 1 ) = sum( data( i1, 1:w ) )
do i2 = 2, S
part1( i1, i2 ) = part1( i1, i2-1 ) &
- data( i1, i2-1 ) + data( i1, i2+w-1 )
enddo
enddo
do i2 = 1, S
part2( 1, i2 ) = sum( part1( 1:w, i2 ) )
do i1 = 2, S
part2( i1, i2 ) = part2( i1-1, i2 ) &
- part1( i1-1, i2 ) + part1( i1+w-1, i2 )
enddo
enddo
call cpu_time( t2 )
print *, "time (s) = ", t2 - t1
print *, "error = ", maxval( abs( part2 - direct ) )
end program
Then, gfortran-7.2 -O3 test.f90 seems to give some nice speedup:
time (s) = 9.64789867
time (s) = 0.345023155
error = 0.00000000
time (s) = 8.60958099E-02
error = 0.00000000
To calculate moving average with mask, a similar approach may work somehow. If we search the net, there may be other (better) approaches/libraries for such moving average, because it is very common calculation...