Efficient convergence check - fortran

I have a grid with thousands of double precision reals.
It's iterating through, and I need it to stop when it's reached convergence to 3 decimal places.
The target is to have it run as fast as possible, but needs to give the same result every (to 3 dp) every time.
At the minute I'm doing something like this
REAL(KIND=DP) :: TOL = 0.001_DP
DO WHILE(.NOT. CONVERGED)
CONVERGED = .TRUE.
DO I = 1, NUM_POINTS
NEW POTENTIAL = !blah blah blah
IF (CONVERGED) THEN
IF (NEW_POTENTIAL < OLD_POTENTIAL - TOL .OR. NEW_POTENTIAL > OLD_POTENTIAL + TOL) THEN
CONVERGED = .FALSE.
END IF
END IF
OLD_POTENTIAL = NEW POTENTIAL
END DO
END DO
I'm thinking that many IF statements can't be too great for performance. I thought about checking for convergence at the end; finding the average value (summing the whole grid, divide by num_points), and checking if that has converged in the same way as above, but I'm not convinced this will always be accurate.
What is the best way of doing this?

If I understand correctly you've got some kind of time-stepping going on, where you create the values in new_potential by calculations on old_potential. Then make old equal to new and carry on.
You could replace your existing convergence tests with the single statement
converged = all(abs(new_potential - old_potential)<tol)
which might be faster. If the speed of the test is a major concern you could test only every other (or every third or fourth ...) iteration
A few comments:
1) If you used a potential array with 2 planes, instead of an old_ and new_potential, you could transfer new_ into old_ by swapping indices at the end of each iteration. As your code stands there's a lot of data movement going on.
2) While semantically you are right to have a while loop, I'd always use a do loop with a maximum number of iterations, just in case the convergence criterion is never met.
3) In your declaration REAL(KIND=DP) :: TOL = 0.001_DP the specification of DP on the numerical value of TOL is redundant, REAL(KIND=DP) :: TOL = 0.001 is adequate. I'd also make this a parameter, the compiler may be able to optimise its use if it knows that it is immutable.
4) You don't really need to execute CONVERGED = .TRUE. inside the outermost loop, set it before the first iteration -- this will save you a nanosecond or two.
Finally, if your convergence criterion is that every element in the potential array has converged to 3dp then that is what you should test for. It would be relatively easy to construct counterexamples for your suggested averages. However, my concern would be that your system will never converge on every element and that you should be using some matrix norm computation to determine convergence. SO is not the place for a lesson in that topic.

What are the calculations for the convergence criteria? Unless they are worse then the calculations to advance the potential it is probably better to have the IF statement to terminate the loop as soon as possible rather than guess a very large number of iterations to be sure to obtain a good solution.
Re High Performance Mark's suggestion #1, if the copying operation is a significant portion of the run time, you could also use pointers.
The only way to be sure about this stuff is to measure the run time ... Fortran provides intrinsic functions to measure both CPU and clock time. Otherwise you may modify your some portion of you code to make it faster, perhaps making it less easier to understand and possibly introducing a bug, possibly without much improvement in runtime ... if that portion was taking a small amount of the total runtime, no amount of cleverness will can make much difference.
As High Performance Mark says, though the current semantics are elegant, you probably want to guard against an infinite loop. One approach:
PotentialLoop: do i=1, MaxIter
blah
Converged = test...
if (Converged) exit PotentialLoop
blah
end do PotentialLoop
if (.NOT. Converged) write (*, *) "error, did not converge"

I = 1
DO
NEWPOT = !bla bla bla
IF (ABS(NEWPOT-OLDPOT).LT.TOL) EXIT
OLDPOT = NEWPOT
I = MOD(I,NUMPOINTS) + 1
END DO
Maybe better
I = 1
DO
NEWPOT = !bla bla bla
IF (ABS(NEWPOT-OLDPOT).LT.TOL) EXIT
OLDPOT = NEWPOT
IF (I.EQ.NUMPOINTS) THEN
I = 1
ELSE
I = I + 1
END IF
END DO

Related

How to determine the number of time steps in Modern Fortran

I need to calculate the number steps to calculate a numerical method in such a way that for i=0 the time be t=tstart and for i=nsteps the time be t=tstop.
I know the real numbers tstart, tstop and dt, and to calculate the integer nsteps I use
nsteps = FLOOR((tstop-tstart)/dt)
but I'm worried, because FLOOR could be and integer minus one than I need. The following is the loop on the time:
DO i=0,nsteps
t = tstart + i*dt
END DO
I think this is a very usual calculus, but I do not know which is the better way to do it. Maybe there is a better idea instead of a DO loop, maybe a DO WHILE loop.
Thanks in advance for yours comments.

Parallel simulation gives different results after some time steps when compared with serial and additional parallel runs

I am trying to run a code on vortex simulations in parallel using OpenMP. These are similar to particle simulations where at each time step, the position of a vortex at the next time step has to be computed from its velocity which is determined by the positions of all the other vortices at the current time step. The vortices are deleted once they leave the domain. I compare the number of vortices at each time step for the parallel version of code with the serial version of code, and run each version multiple times.
For the serial versions, vortex counts match exactly at every time step. For the parallel case, all the runs match with the serial case for a few tens of time steps, post which, each parallel run shows a difference but remains within a 7-10% error bound with the serial case (as can be seen in the result link below). I know that this may be because of the round off errors in the parallel case owing from the difference in the order of computational steps due to distribution among the different threads, but should the error really be so high as 10%?
I have only used the reduction clause in a parallel do construct. The only parallel region in the whole code is within a function vblob(), which is inside a module, which I call from a main code. All function calls within vblob() are ixi(), fxi() are outside this module.
function vblob(blobs,xj,gj)
complex(8), intent(in) :: blobs(:,:), xj
complex(8) :: delxi, delxic, di, gvic, xi
real(8), intent(in) :: gj
real(8) :: vblob(2)
integer :: p
gvic = 0.0; delxi = 0.0; delxic = 0.0; di = 0.0; xi = 0.0
!$omp parallel do private(xi,delxic,delxi,di) shared(xj) reduction(+:gvic)
do p = 1, size(blobs,1)
xi = ixi(blobs(p,1))
delxic = xj-conjg(xi)
delxi = xj-xi
di = del*fxi(xi)
gvic = gvic + real(blobs(p,2))*1/delxic
if (abs(delxi) .gt. 1.E-4) then
gvic = gvic + (-1)*real(blobs(p,2))*1/delxi
end if
end do
!$omp end parallel do
gvic = j*gvic*fxi(xj)/(2*pi)
vblob(1) = real(gvic)
vblob(2) = -imag(gvic)
end function vblob
If the way I have constructed the parallel code is wrong, then errors should show up from the first few time steps itself, right?
(As can be seen in this result, the 'blobs' and 'sheets' are just types of vortex elements, the blue line is the total elements. P and S stand for Parallel and serial respectively and R stands for runs. THe solid plot markers are the serial code and the hollow ones are the three runs of the parallel code)
EDIT: When i change the numerical precision of my variables to real(4) instead, the divergenec in results happens at an earlier time step than the real(8) case above. SO its clearly a round off error issue.
TLDR: I want to clarify this with anyone else who has seen such a result over a range of time steps, where the parallel code matches for the first few time steps and then diverges?
Your code essentially sums up a lot of terms in gvic. Floating-point arithmetic is not associative, that is, (a+b)+c is not the same as a+(b+c) due to rounding. Also, depending on the values and the signs on the terms, there may be a serious loss of precision in each operation. See here for a really mandatory read on the subject.
While the sequential loop computes (given no clever compiler optimisations):
gvic = (...((((g_1 + g_2) + g_3) + g_4) + g_5) + ...)
where g_i is the value added to gvic by iteration i, the parallel version computes:
gvic = t_0 + t_1 + t_2 + ... t_(#threads-1)
where t_i is the accumulated private value of gvic in thread i (threads in OpenMP are 0-numbered even in Fortran). The order in which the different t_is are reduced is unspecified. The OpenMP implementation is free to choose whatever it deems fine. Even if all t_is are summed in order, the result will still differ from the one computed by the sequential loop. Unstable numerical algorithms are exceptionally prone to producing different results when parallelised.
This is something you can hardly avoid completely, but instead learn to control or simply live with its consequences. In many cases, the numerical solution to a problem is an approximation anyway. You should focus on conserved or statistical properties. For example, an ergodic molecular dynamics simulation may produce a completely different phase trajectory in parallel, but values such as the total energy or the thermodynamic averages will be pretty close (unless there is some serious algorithmic error or really bad numerical instability).
A side note - you are actually lucky to enter this field now, when most CPUs use standard 32- and 64-bit floating-point arithmetic. Years ago, when x87 was a thing, floating-point operations were done with 80-bit internal precision and the end result would depend on how many times a value leaves and re-enters the FPU registers.

Repeating a step in a Fortran loop

I'm trying to write a Fortran 90 program to carry out Euler's method to solve ode's using an adaptive time step.
I have an if statement inside of a do while loop, in which I check that the error at each iteration of the code is less than a certain tolerance. However, if it is not less than certain tolerance, I must change a certain value (the step size) and carry out the calculation again to get a new error to compare with the tolerance.
It looks something like (and forgive me this is my first time using this website):
do while (some condition)
(Get an approximation to the ODE with various subroutine calls)
(Calculate the error)
if (error < tol)
step = step/2
else
step = 2*step
(Something that will return to the top of my do while loop)
end if
end do
Say for example, I had do while (i < 4), where i starts at 1, and my error was not less than my tolerance, I would have to repeat the calculation again for i=1 with a new step size.
I hope this makes sense to those of you who read this. If you need any clarification, let me know.
Because there is no explicit counter in the do while loop unlike in a normal do i=1,... loop. you can just use cycle to start a new iteration. It will be the same as repeating the current iteration. But the condition will be evaluated again. If it shouldn't be evaluated, you would have to use go to or restructure your code.
Or another loop nested inside the main loop might be better, but that probably counts as the restructuring mentioned above. Depends what is the condition, how you change step and i and how the tolerance depends on the step and i.

Precision problems with very large reals - Fortran

The problem I'm attempting to tackle at the moment involves computing the order of 10 modulo(n), where n could be any number less than 1000. I have a function to do exactly that, however, I am unable to obtain accurate results as the value of the order increases.
The function works correctly as long as the order is sufficiently small, but returns incorrect values for large orders. So I stuck in some output to the terminal to locate the problem, and discovered that when I use exponentiation, the accuracy of my reals is being compromised.
I declared ALL variables in the function and in the program I tested it from as real(kind=nkind) where nkind = selected_real_kind(p=18, r=308). Any numbers explicitly referenced are also declared as, for example, 1.0_nkind. However, when I print out 10**n for n counting up from 1, I find that at 10**27, the value is correct. However, 10**28 gives 9999999999999999999731564544. All higher powers are similarly distorted, and this inaccuracy is the source of my problem.
So, my question is, is there a way to work around the error? I don't know of any way to use a more extended precision than I'm already using in the calculations.
Thanks,
Sean
*EDIT: There's not much to see in the code, but here you go:
integer, parameter :: nkind = selected_real_kind(p=18, r = 308)
real(kind=nkind) function order_ten_modulo(n)
real(kind=nkind) :: n, power
power = 1.0_nkind
if (mod(n, 5.0_nkind) == 0 .or. mod(n, 2.0_nkind) == 0) then
order_ten_modulo = 0
return
end if
do
if (power>300.0) then ! Just picked this number as a safeguard against endless looping -
exit
end if
if (mod(10.0_nkind**power, n) == 1.0_nkind) then
order_ten_modulo = power
exit
end if
power = power + 1.0_nkind
end do
return
end function order_ten_modulo

Fast round up/down double in fortran?

Is there fast way to do round up/down in Fortran?
Because of linear order of bit-representation of positive double numbers it's possible to implement rounding as below.
pinf and ninf are global constants which are +/- infinity respectively
function roundup(x)
double precision ,intent(in) :: x
double precision :: roundup
if (isnan(x))then
roundup = pinf
return
end if
if (x==pinf)then
roundup = pinf
return
end if
if (x==ninf)then
roundup = ninf
return
end if
if (x>0)then
roundup = transfer((transfer(x,1_8)+1_8),1d0)
else if (x<0) then
roundup = transfer((transfer(x,1_8)-1_8),1d0)
else
if (transfer(x,1_8)==Z'0000000000000000')then
roundup = transfer((transfer(x,1_8)+1_8),1d0)
else
roundup = transfer((transfer(-x,1_8)+1_8),1d0)
end if
end if
end function roundup
I feel it's not the best way to do that because it's slow, but it uses almost only bit-operations.
Another way is using multiplication and some epsilon
eps = epsilon (1d0)
function roundup2(x)
double precision ,intent(in) :: x
double precision :: roundup2
if (isnan(x)) then
roundup2 = pinf
return
else if (x>=eps) then
roundup2 = x*(1d0+eps)
else if (x<=-eps) then
roundup2 = x*(1d0-eps)
else
roundup2 = eps
end if
end function roundup2
For some x both functions returns the same result (1d0, 158d0), for some don't (0.1d0, 15d0).
The first function is more accurate, but it's about 3.6 times slower than second
(11.1 vs 3.0 seconds on 10^9 rounds test)
print * ,x,y,abs(x-y)
do i = 1, 1000000000
x = roundup(x)
!y = roundup2(y)
end do
print * ,x,y,abs(x-y)
With no checks for NaN/Infinities first function test takes 8.5 seconds (-20%).
I use round function really hard and it takes a lot of time in profile of program. Is there cross-platform way to round faster with no loose of precision?
Update
The question suspects calls of roundup and rounddown at the time with no ability to reorder them. I didn't mention rounddown to keep topic short.
Hint:
First function uses two transfer function and one adding. And it's slower than one multiplication and one adding in the second case. Why transfer cost so much when it doesn't do any with the number's bits? Is it possible to replace transfer by faster function(s) or avoid addition calls at all?
I would recommend that you look at the Fortran standard IEEE floating point intrinsic modules (IEEE_ARITHMETIC, IEEE_FEATURES, IEEE_EXCEPTIONS). These provide IEEE_SET_ROUNDING_MODE where you can set the rounding mode for subsequent operations. Ideally you'd use IEEE_GET_ROUNDING_MODE to get the current mode and save it, set the new one, do your operations, then restore the mode.
Some caveats - changing the processor rounding mode is itself a slow operation, but if you do it once and then do lots of rounds, that will be a win. Not all current Fortran compilers support the IEEE intrinsic modules, but most reasonable ones should. You might need to tell the compiler you are playing with the IEEE environment - for Intel Fortran, use "-fp-model strict".
If I'm understanding correctly what you want to do, doesn't the "nearest" intrinsic do what you want, if you feed it +/- infinity as the arguments?
http://gcc.gnu.org/onlinedocs/gfortran/NEAREST.html#NEAREST
This might work, if the compiler implements this with decent performance. If you want NaN to round to Inf, you'll have to add that in a wrapper.
As for why roundup2 is faster, I can't tell for certain what's going on on your machine, but I can say two things:
The addition in roundup2 is probably optimized out (if eps is a parameter?) , so there's really just a multiplication.
If the transfer really does anything at all, that could easily slow the function down noticeably, since the function itself is so short. That might even be true if the transfer is just making superfluous copies of x.