Related
A small example serial code, which has the same structure as my code, is shown below.
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j
DOUBLE PRECISION :: en,ei,es
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(i) + DBLE(j)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.
Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).
I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?
Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?
On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?
Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.
As mentioned in the comments, a good answer will require more detailed question. However, at a first sight it seems that parallelizing the internal loop
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
should be enough to solve your problem, or at least it will be a good starter. First of all I guess that there is an error on the loop
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(j,k) = DBLE(j) + DBLE(k)
END DO
END Do
since the k is set to 0 and and there is no cell with address corresponding to 0 (see your variable declaration). Also ki is declared ki(1000,2000) array while ki(j,i) is (2000,1000) array. Beside these error, I guess that ki should be calculated as
ki(i,j) = DBLE(j) + DBLE(i)
if true, I suggest you the following solution
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j, k,icr,icr0,icr1
DOUBLE PRECISION :: en,ei,es,timerRate
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
INTEGER,PARAMETER:: nthreads=1
call system_clock(count_rate=icr)
timerRate=real(icr)
call system_clock(icr0)
call omp_set_num_threads(nthreads)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(j) + DBLE(i)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
!$OMP PARALLEL DO private(j,kn,ei) firstpribate(en) shared(ki) reduction(+:es)
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
!$OMP END PARALLEL DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
call system_clock(icr1)
write (*,*) (icr1-icr0)/timerRate ! return computing time
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I add some variables to check the computing time ;-).
This solution is computed in 5.14 s, for nthreads=1, and in 2.75 s, for nthreads=2. It does not divide the computing time by 2, but it seems to be a good deal for a first shot. Unfortunately, on this machine I have a core i3 proc. So I can't do better than nthreads=2. However, I wonder, how the code will behave with nthreads=16 ???
Please let me know
I hope that this helps you.
Finally, I warn about the choice of variables status (private, firstprivate and shared) that might be consider carefully in the real code.
I am currently working on adding openmp parallelization for a do loop in one of the codes I have written for research. I am fairly new to using openmp so I would appreciate if you had any suggestions for what might be going wrong.
Basically, I have added a parallel do loop to the following code (which works prior to parallelization). r(:,:,:,:) is a vector of a ton of molecular coordinates indexed by time, molecule, atom, and (xyz). This vector is about 100 gb of data (I am working on an HPC with plenty of RAM). I am trying to parallelize the outer loop and subdivide it between processors so that I can reduce the amount of time this calculation goes. I thought it would be a good one to do it with as msd and cm_msd are the only things that would need to be edited by multiple processors and stored for later, which since each iteration gets its own element of these arrays they won't have a race condition.
The problem: If I run this code 5 times I get varying results, sometimes msd is calculated correctly (or appears to be), and sometimes it outputs all zeros later when I average it together. Without parallelization there are no issues.
I have been trying altering the shared vs private variables in the code and I think I have accounted for everything. The i index of the msd array and msd_cm array should never be equivalent between threads so I would think that they wouldn't be an issue.
! Loop over time origins
counti = 0
ind = 0
!$OMP PARALLEL DO schedule(static) PRIVATE(i,j,k,it,r_old,r_cm_old,shift,shift_cm,dsq,ind) &
!$OMP& SHARED(msd,msd_cm)
do i=1, nconfigs-nt, or_int
if(MOD(counti*or_int,500) == 0) then
write(*,*) 'Reached the ', counti*or_int,'th time origin'
end if
! Set the Old Coordinates
counti = counti + 1
ind = (i-1)/or_int + 1
r_old(:,:,:) = r(i,:,:,:)
r_cm_old(:,:) = r_cm(i,:,:)
shift = 0.0
shift_cm = 0.0
! Loop over the timesteps in each trajectory
do it=i+2, nt+i
! Loop over molecules
do j = 1, nmols
do k=1, atms_per_mol
! Calculate the shift if it occurs.
shift(j,k,:) = shift(j,k,:) - L(:)*anint((r(it,j,k,:) - &
r_old(j,k,:) )/L(:))
! Calculate the square displacements
dsq = ( r(it,j,k,1) + shift(j,k,1) - r(i,j,k,1) ) ** 2. &
+( r(it,j,k,2) + shift(j,k,2) - r(i,j,k,2) ) ** 2. &
+( r(it,j,k,3) + shift(j,k,3) - r(i,j,k,3) ) ** 2.
msd(ind, it-1-i, k) = msd(ind, it-1-i, k) + dsq
! Calculate the contribution to the c1,c2
enddo ! End Atoms Loop (k)
! Calculate the shift if it occurs.
shift_cm(j,:) = shift_cm(j,:) - L(:)*anint((r_cm(it,j,:) - &
r_cm_old(j,:) )/L(:))
! Calculate the square displacements
dsq = ( r_cm(it,j,1) + shift_cm(j,1) - r_cm(i,j,1) ) ** 2. &
+( r_cm(it,j,2) + shift_cm(j,2) - r_cm(i,j,2) ) ** 2. &
+( r_cm(it,j,3) + shift_cm(j,3) - r_cm(i,j,3) ) ** 2.
msd_cm(ind,it-1-i) = msd_cm(ind, it-1-i) + dsq
enddo ! End Molecules Loop (j)
r_old(:,:,:) = r(it,:,:,:)
r_cm_old(:,:) = r_cm(it,:,:)
enddo ! End t's loop (it)
enddo
!$OMP END PARALLEL DO
When this code is run, when I later print the averaged msd results out they either come out as correctly or they come out as zero and it is always one or the other. Do you see an issue that might explain why it is only working part of the time. I am brand new to openmp so it is completely possible there is just something incredibly stupid with how I am trying to do this.
Thanks in advance!
I am trying to calculate something similar to a weighted matrix inner product in Fortran. The current script that I am using for calculating the inner product is as follows
! --> In
real(kind=8), intent(in), dimension(ni, nj, nk, nVar) :: U1, U2
real(kind=8), intent(in), dimension(ni, nj, nk) :: intW
! --> Out
real(kind=8), intent(out) :: innerProd
! --> Local
integer :: ni, nj, nk, nVar, iVar
! --> Computing inner product
do iVar = 1, nVar
innerProd = innerProd + sum(U1(:,:,:,iVar)*U2(:,:,:,iVar)*intW)
enddo
But I found that the above script that I am currently using is not very efficient. The same operation can be performed in Python using NumPy as follows,
import numpy as np
import os
# --> Preventing numpy from multi-threading
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
innerProd = 0
# --> Toy matrices
U1 = np.random.random((ni,nj,nk,nVar))
U2 = np.random.random((ni,nj,nk,nVar))
intW = np.random.random((ni,nj,nk))
# --> Reshaping
U1 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
U2 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
intW = np.reshape(np.ravel(intW), (ni*nj*nk))
# --> Calculating inner product
for iVar in range(nVar):
innerProd = innerProd + np.dot(U1[:, iVar], U2[:, iVar]*intW)
The second method using Numpy seems to be far more faster than the method using Fortran. For a specific case of ni = nj = nk = nVar = 130, the time taken by the two methods are as follows
fortran_time = 25.8641 s
numpy_time = 6.8924 s
I tried improving my Fortran code with ddot from BLAS as follows,
do iVar = 1, nVar
do k = 1, nk
do j = 1, nj
innerProd = innerProd + ddot(ni, U1(:,j,k,iVar), 1, U2(:,j,k,iVar)*intW(:,j,k), 1)
enddo
enddo
enddo
But there was no considerable improvement in time. The time taken by the above method for the case of ni = nj = nk = nVar = 130 is ~24s. (I forgot to mention that I compiled the Fortran code with '-O2' option for optimizing the performance).
Unfortunately, there is no BLAS function for element-wise matrix multiplication in Fortran. And I don't want to use reshape in Fortran because unlike python reshaping in Fortran will lead to copying my array to a new array leading to more RAM usage.
Is there any way to speed up the performance in Fortran so as to get close to the performance of Numpy?
You may not be timing what you think are timing. Here's a complete fortran example
program test
use iso_fortran_env, r8 => real64
implicit none
integer, parameter :: ni = 130, nj = 130, nk = 130, nvar = 130
real(r8), allocatable :: u1(:,:,:,:), u2(:,:,:,:), w(:,:,:)
real(r8) :: sum, t0, t1
integer :: i,j,k,n
call cpu_time(t0)
allocate(u1(ni,nj,nk,nvar))
allocate(u2(ni,nj,nk,nvar))
allocate(w(ni,nj,nk))
call cpu_time(t1)
write(*,'("allocation time(s):",es15.5)') t1-t0
call cpu_time(t0)
call random_seed()
call random_number(u1)
call random_number(u2)
call random_number(w)
call cpu_time(t1)
write(*,'("random init time (s):",es15.5)') t1-t0
sum = 0.0_r8
call cpu_time(t0)
do n = 1, nvar
do k = 1, nk
do j = 1, nj
do i = 1, ni
sum = sum + u1(i,j,k,n)*u2(i,j,k,n)*w(i,j,k)
end do
end do
end do
end do
call cpu_time(t1)
write(*,'("Sum:",es15.5," time(s):",es15.5)') sum, t1-t0
end program
And the output:
$ gfortran -O2 -o inner_product inner_product.f90
$ time ./inner_product
allocation time(s): 3.00000E-05
random init time (s): 5.73293E+00
Sum: 3.57050E+07 time(s): 5.69066E-01
real 0m6.465s
user 0m4.634s
sys 0m1.798s
Computing the inner product is less that 10% of the runtime in this fortran code. How/What you are timing is very important. Are you sure you are timing the same things in the fortran and python versions? Are you sure you are only timing the inner_product calculation?
This avoids making any copy. (note the blas ddot approach still needs to make a copy for the element-wise product)
subroutine dot3(n,a,b,c,result)
implicit none
real(kind=..) a(*),b(*),c(*),result
integer i,n
result=0
do i=1,n
result=result+a(i)*b(i)*c(i)
enddo
end
dot3 is external, meaning not in a module/contains construct. kind should obviously match main declaration.
in main code:
innerprod=0
do iVar = 1, nVar
call dot3(ni*nj*nk, U1(1,1,1,iVar),U2(1,1,1,iVar),intW,result)
innerProd=innerProd+result
enddo
I had the same observation comparing Numpy and Fortran code.
The difference turns out to be the version of BLAS, I found using DGEMM from netlib is similar to looping and about three times slower than OpenBLAS (see profiles in this answer).
The most surprising thing for me was that OpenBLAS provides code which is so much faster than just compiling a Fortran triple nested loop. It seems this is the whole point of GotoBLAS, which was handwritten in assembly code for the processor architecture.
Even timing the right thing, ordering loops correctly, avoiding copies and using every optimising flag (in gfortran), the performance is still about three times slower than OpenBLAS. I've not tried ifort or pgi, but I wonder if this explains the upvoted comment by #kvantour "loop finishes in 0.6s for me" (note intrinsic matmul is replaced by BLAS in some implementations).
I am trying to write a program to solve for pipe diameter for a pump system I've designed. I've done this on paper and understand the mechanics of the equations. I would appreciate any guidance.
EDIT: I have updated the code with some suggestions from users, still seeing quick divergence. The guesses in there are way too high. If I figure this out I will update it to working.
MODULE Sec
CONTAINS
SUBROUTINE Secant(fx,xold,xnew,xolder)
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP), PARAMETER:: gamma=62.4
REAL(DP)::z,phead,hf,L,Q,mu,rho,rough,eff,pump,nu,ppow,fric,pres,xnew,xold,xolder,D
INTEGER::I,maxit
INTERFACE
omitted
END INTERFACE
Q=0.0353196
Pres=-3600.0
z=-10.0
L=50.0
mu=0.0000273
rho=1.940
nu=0.5
rough=0.000005
ppow=412.50
xold=1.0
xolder=0.90
D=11.0
phead = (pres/gamma)
pump = (nu*ppow)/(gamma*Q)
hf = phead + z + pump
maxit=10
I = 1
DO
xnew=xold-((fx(xold,L,Q,hf,rho,mu,rough)*(xold-xolder))/ &
(fx(xold,L,Q,hf,rho,mu,rough)-fx(xolder,L,Q,hf,rho,mu,rough)))
xolder = xold
xold = xnew
I=I+1
WRITE(*,*) "Diameter = ", xnew
IF (ABS(fx(xnew,L,Q,hf,rho,mu,rough)) <= 1.0d-10) THEN
EXIT
END IF
IF (I >= maxit) THEN
EXIT
END IF
END DO
RETURN
END SUBROUTINE Secant
END MODULE Sec
PROGRAM Pipes
USE Sec
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP)::xold,xolder,xnew
INTERFACE
omitted
END INTERFACE
CALL Secant(f,xold,xnew,xolder)
END PROGRAM Pipes
FUNCTION f(D,L,Q,hf,rho,mu,rough)
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP), PARAMETER::pi=3.14159265d0, g=9.81d0
REAL(DP), INTENT(IN)::L,Q,rough,rho,mu,hf,D
REAL(DP)::f, fric, reynold, coef
fric=(hf/((L/D)*(((4.0*Q)/(pi*D**2))/2*g)))
reynold=((rho*(4.0*Q/pi*D**2)*D)/mu)
coef=(rough/(3.7d0*D))
f=(1/SQRT(fric))+2.0d0*log10(coef+(2.51d0/(reynold*SQRT(fric))))
END FUNCTION
You very clearly declare the function in the interface (and the implementation) as
FUNCTION f(L,D,Q,hf,rho,mu,rough)
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP), PARAMETER::pi=3.14159265, g=9.81
REAL(DP), INTENT(IN)::L,Q,rough,rho,mu,hf,D
REAL(DP)::fx
END FUNCTION
So you need to pass 7 arguments to it. And none of them are optional.
But when you call it, you call it as
xnew=xold-fx(xold)*((xolder-xold)/(fx(xolder)-fx(xold))
supplying a single argument to it. When you try to compile it with gfortran for example, the compiler will complain for not getting any argument for D (the second dummy argument), because it stops with the first error.
It seems that the initial values for xold and xolder are too far from the solution. If we change them as
xold = 3.0d-5
xolder = 9.0d-5
and changing the threshold for convergence more tightly as
IF (ABS(fx(xnew,L,Q,hf,rho,mu,rough)) <= 1.0d-10) THEN
then we get
...
Diameter = 7.8306011049894322E-005
Diameter = 7.4533171406818087E-005
Diameter = 7.2580746283970710E-005
Diameter = 7.2653611474296094E-005
Diameter = 7.2652684750264582E-005
Diameter = 7.2652684291155581E-005
Here, we note that the function f(x) is defined as
FUNCTION f(D,L,Q,hf,rho,mu,rough)
...
f = (1/(hf/((L/D)*((4*Q)/pi*D)))) !! (1)
+ 2.0 * log( (rough/(3.7*D)) + (2.51/(((rho*((4*Q)/pi*D))/mu) !! (2)
* (hf/((L/D)*((4*Q)/pi*D))))) !! (3)
)
END FUNCTION
where terms in Lines (1) and (3) are both constant, while terms in Line (2) are some constants over D. So, we see that f(D) = c1 - 2.0 * log( D / c2 ), so we can obtain the solution analytically as D = c2 * exp(c1/2.0) = 7.26526809959e-5, which agrees well with the numerical solution above. To get a rough idea of where the solution is, it is useful to plot f(D) as a function of D, e.g. using Gnuplot.
But I am afraid that the expression for f(D) itself (given in the Fortran code) might include some typo due to many parentheses. To avoid such issues, it is always useful to first arrange the expression for f(D) as simplest as possible before making a program.
(One TIP is to extract constant factors outside and pre-calculate them.)
Also, for debugging purposes it is sometimes useful to check the consistency of physical dimensions and physical units of various terms. Indeed, if the magnitude of the obtained solution is too large or too small, there might be some problem of conversion factors for physical units, for example.
I am trying to write a program to solve for pipe diameter for a pump system I've designed. I've done this on paper and understand the mechanics of the equations. I would appreciate any guidance.
EDIT: I have updated the code with some suggestions from users, still seeing quick divergence. The guesses in there are way too high. If I figure this out I will update it to working.
MODULE Sec
CONTAINS
SUBROUTINE Secant(fx,xold,xnew,xolder)
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP), PARAMETER:: gamma=62.4
REAL(DP)::z,phead,hf,L,Q,mu,rho,rough,eff,pump,nu,ppow,fric,pres,xnew,xold,xolder,D
INTEGER::I,maxit
INTERFACE
omitted
END INTERFACE
Q=0.0353196
Pres=-3600.0
z=-10.0
L=50.0
mu=0.0000273
rho=1.940
nu=0.5
rough=0.000005
ppow=412.50
xold=1.0
xolder=0.90
D=11.0
phead = (pres/gamma)
pump = (nu*ppow)/(gamma*Q)
hf = phead + z + pump
maxit=10
I = 1
DO
xnew=xold-((fx(xold,L,Q,hf,rho,mu,rough)*(xold-xolder))/ &
(fx(xold,L,Q,hf,rho,mu,rough)-fx(xolder,L,Q,hf,rho,mu,rough)))
xolder = xold
xold = xnew
I=I+1
WRITE(*,*) "Diameter = ", xnew
IF (ABS(fx(xnew,L,Q,hf,rho,mu,rough)) <= 1.0d-10) THEN
EXIT
END IF
IF (I >= maxit) THEN
EXIT
END IF
END DO
RETURN
END SUBROUTINE Secant
END MODULE Sec
PROGRAM Pipes
USE Sec
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP)::xold,xolder,xnew
INTERFACE
omitted
END INTERFACE
CALL Secant(f,xold,xnew,xolder)
END PROGRAM Pipes
FUNCTION f(D,L,Q,hf,rho,mu,rough)
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP), PARAMETER::pi=3.14159265d0, g=9.81d0
REAL(DP), INTENT(IN)::L,Q,rough,rho,mu,hf,D
REAL(DP)::f, fric, reynold, coef
fric=(hf/((L/D)*(((4.0*Q)/(pi*D**2))/2*g)))
reynold=((rho*(4.0*Q/pi*D**2)*D)/mu)
coef=(rough/(3.7d0*D))
f=(1/SQRT(fric))+2.0d0*log10(coef+(2.51d0/(reynold*SQRT(fric))))
END FUNCTION
You very clearly declare the function in the interface (and the implementation) as
FUNCTION f(L,D,Q,hf,rho,mu,rough)
IMPLICIT NONE
INTEGER,PARAMETER::DP=selected_real_kind(15)
REAL(DP), PARAMETER::pi=3.14159265, g=9.81
REAL(DP), INTENT(IN)::L,Q,rough,rho,mu,hf,D
REAL(DP)::fx
END FUNCTION
So you need to pass 7 arguments to it. And none of them are optional.
But when you call it, you call it as
xnew=xold-fx(xold)*((xolder-xold)/(fx(xolder)-fx(xold))
supplying a single argument to it. When you try to compile it with gfortran for example, the compiler will complain for not getting any argument for D (the second dummy argument), because it stops with the first error.
It seems that the initial values for xold and xolder are too far from the solution. If we change them as
xold = 3.0d-5
xolder = 9.0d-5
and changing the threshold for convergence more tightly as
IF (ABS(fx(xnew,L,Q,hf,rho,mu,rough)) <= 1.0d-10) THEN
then we get
...
Diameter = 7.8306011049894322E-005
Diameter = 7.4533171406818087E-005
Diameter = 7.2580746283970710E-005
Diameter = 7.2653611474296094E-005
Diameter = 7.2652684750264582E-005
Diameter = 7.2652684291155581E-005
Here, we note that the function f(x) is defined as
FUNCTION f(D,L,Q,hf,rho,mu,rough)
...
f = (1/(hf/((L/D)*((4*Q)/pi*D)))) !! (1)
+ 2.0 * log( (rough/(3.7*D)) + (2.51/(((rho*((4*Q)/pi*D))/mu) !! (2)
* (hf/((L/D)*((4*Q)/pi*D))))) !! (3)
)
END FUNCTION
where terms in Lines (1) and (3) are both constant, while terms in Line (2) are some constants over D. So, we see that f(D) = c1 - 2.0 * log( D / c2 ), so we can obtain the solution analytically as D = c2 * exp(c1/2.0) = 7.26526809959e-5, which agrees well with the numerical solution above. To get a rough idea of where the solution is, it is useful to plot f(D) as a function of D, e.g. using Gnuplot.
But I am afraid that the expression for f(D) itself (given in the Fortran code) might include some typo due to many parentheses. To avoid such issues, it is always useful to first arrange the expression for f(D) as simplest as possible before making a program.
(One TIP is to extract constant factors outside and pre-calculate them.)
Also, for debugging purposes it is sometimes useful to check the consistency of physical dimensions and physical units of various terms. Indeed, if the magnitude of the obtained solution is too large or too small, there might be some problem of conversion factors for physical units, for example.