Fortran Openmp: Ghost threads interrupt application upon allocate

Fortran Openmp: Ghost threads interrupt application upon allocate - fortran

I am experiencing an issue which starts to drive me crazy. As part of a large software tool I am writing a Fortran module to perform astrodynamic computations.
The main functionality is contained in a subroutine which uses OpenMP and other subroutines to perform its tasks.
The workflow is as follows:
1) Preparations - read in files
2) Start the first parallel omp parallel do
3) Based on the working mode, the function might call itself recursively and process a separate part of the data (thereby executing a second omp do loop)
4) Combine the results into one large array of a derived type
Up to this point everything has been running as expected. Next is the part that fails:
5) Using the results from 4, start another parallel loop to continue the processing.
Step 5 sometimes works and sometimes it stops with an access violation. This is independently of the optimization level selected (nominally we use O2, but it happens also with O0).
Hence I fired up Intel inspector and looked for data races/deadlocs,....
It reported 5 very strange data races, which make no sense to me, as the code locations are either at the definition/end of a subroutine, a codeline reading threadprivate global variable, an Intel MKL routine or at a location reading a local allocatable array which is in a subroutine that is called within the parallel region.
After reading upon things a bit, I enabled the recursive switch to force the local array to end on the stack. Another Inspector analysis was interrupted by the same access violation. When loading the results no errors were detected, but the tool warns that due to the abnormal end data may have been lost.
I then tried to put the entire loop into an OMP critical - just as a test. Now the access violation disappeared, but the code is stuck very early.After processing 15/500 objects it just stops. This looked like a deadlock.
So I continued the search and commented out every OMP statement in the do loop of step 5.
Interestingly the code stops as well at iteration 15!
Hence I used the debugger to locate the call where the infinite waiting state occurs.
It is at the final allocate statement:
Subroutine doWork(t0, Pert, allAtmosData, considerArray, perigeeFirstTimeStepUncertainties)
real(dp), intent(in) :: t0
type (tPert), intent(in) :: Pert
type (tThermosphereData), dimension(:), allocatable, intent(in) :: allAtmosData
integer(i4), dimension(3), intent(in) :: considerArray
real(dp), dimension(:,:), allocatable, intent(out) :: perigeeFirstTimeStepUncertainties
!Locals:
real(dp), dimension(:), allocatable :: solarFluxPerigeeUnc, magneticIndexUnc, modelUnc
integer(i4) :: dataPoints
!Allocate memory for the computations
dataPoints = size(allAtmosData, 1)
allocate(solarFluxPerigeeUnc(0:dataPoints-1), source=0.0_dp)
allocate(magneticIndexUnc(0:dataPoints-1), source=0.0_dp)
allocate(modelUnc(0:dataPoints-1), source=0.0_dp)
... Subroutine body ...
!Assign outputs
allocate(perigeeFirstTimeStepUncertainties(3, 0:dataPoints-1), source=0.0_dp)
perigeeFirstTimeStepUncertainties(MGN_SOLAR_FLUX_UNCERTAINTY,:) = solarFluxPerigeeUnc
perigeeFirstTimeStepUncertainties(MGN_MAG_INDEX_UNCERTAINTY,:) = magneticIndexUnc
perigeeFirstTimeStepUncertainties(MGN_MODEL_UNCERTAINTY,:) = modelUnc
End Subroutine
The subroutine works perfectly in other places of the overall program and also for the first
14 iterations of step 5. In the 15th iteration however it is stuck at the last allocate, or as I just have managed to produce - also results in an access violation.
When I pause the code using the debugger, I see that somehow the openmp library has been loaded and the code hangs/crashes in frontend.cpp, which I have no access to:
What is happening here? How can a simple allocate suddenly cause and OMP activity in a loop where any OMP statement has been commented out?
As a sidenote: If I comment out the allocate and pass the size as an extra argument, the same error happens at the next allocate in the code.
Any help or workaround is highly appreciated! I already googled on how to shut down any prior thread-pool, so that there is definately no omp ghost thread activity left, but apparently there is no way to do so. Also ultimately the loop shall work in parallel.
EDIT: I just did a second test on the machine with 2018 v5. It also shows that the allocate statement results in a call to for_alloc_allocatable() and then the OMP statements. Also the mov statement where the program crashes is shown:
I guess that something is probably causing memory corruption in the second recursive OMP loop, as somehow the allocate is redirected to a threadpool memory allocation, however all threads but the main one should be passive? I cannot read assembler, but I was surprised to find that TBB is used by OpenMp? Could this be a bug in TBB?

Related

Poor scaling and a segmentation fault in a Fortran OpenMP code

I'm having some trouble when executing a program with a parallel do. Here is a test code.
module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains
subroutine Addition(x,y,s)
real(dp),intent(in) :: x,y
real(dp), intent(out) :: s
s = x+y
end subroutine Addition
function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
real(dp),intent(in) :: xi,xf
integer, intent(in) :: length
real(dp),dimension(1:length) :: vec
integer ::i
real(dp) :: increment
increment = (xf-xi)/(real(length)-1)
vec(1) = xi
do i = 2,length
vec(i) = vec(i-1) + increment
end do
end function linspace
end module test
program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix
x = linspace(length,.0d0,1.0d0)
y = linspace(length,2.0d0,3.0d0)
!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
do j = 1,size(y)
call Addition(x(i),y(j),s)
SMatrix(i,j) = s
end do
end do
!$OMP END PARALLEL DO
open(unit=1,file ='Add6.dat')
do i= 1,size(x)
do j= 1,size(y)
write(1,*) x(i),";",y(j),";",SMatrix(i,j)
end do
end do
close(unit=1)
end program paralleltest
I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8
This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.

As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.
Regarding the time, almost all of the runtime is spent in writing the array to the external file.
But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).
As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.
And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).
With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.
If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.
Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

Program is not any faster with OpenMP

My goal is to parallelize a section in my Fortran program. The flow of the program is:
Read data from a file
make some computations
write the results to 2 different files
Here I want to parallelize the writing process since I’m writing into different files.
module foo
use omp_lib
implicit none
type element
integer, dimension(:), allocatable :: v1, v2
real(kind=8), dimension(:,:), allocatbale :: M
end type element
contains
subroutine test()
implicit none
type(element) :: e
do
e = read_data_from_file()
call compute_data(e)
!$OMP SECTIONS
!$OMP SECTION
!$ call write_to_file1(e)
!$OMP SECTION
!$ call write_to_file2(e)
!$OMP END SECTIONS
end do
end subroutine test
...
end module foo
But this program isn't going anything faster. So I think that I’m missing something?

In general one can divide scientific computing codes in bandwidth bound and computational bound algorithms. The bandwidth bound algorithms are all that only do few operations on the data they need. Like having O(n) data where O(n) flops are performed on. Thinking of the hard disk speed or the network connection speed, I/O is a bandwidth bound operation as well and therefore not or only badly parallelizable.
If you really want to gain performance out of the parallelization split the code into bandwidth bound and computational bound algorithms and use your time to parallelize the later ones.

If you specify you problem more precisely there are hundreds of experts eager to solve it. From the comment to the answer above I see that you are using binary output but still has bandwidth left to write faster, that means that you disk speed is fine and you're not limited by parsing, but rather that you actual program is not putting out data in a faster pace than this.
So optimize your code, to make it catch up with your write-speed, instead of increasing the write speed with an equally slow code.
Writing them 2 files sequentially at the max of your bandwidth is as fast and much easier than writing in parallel (at the same max speed).
If I am mistaken, and you are indeed limited by IO, maybe this other question/answer can help you: How to avoid programs in status D.

How can a Fortran-OpenACC routine call another Fortran-OpenACC routine?

I'm currently attempting to accelerate a spectral element fluids solver by porting most of the routines to a GPGPU using OpenACC with the PGI (15.10) compiler. The source code is written in OO-Fortran. This software has "layers" of subroutines that call other functions and subroutines. To bring the code over to a GPU using openacc, I've been first attempting to place "$acc routine" directives in each routine that needs to be ported. During compilation, using "pgf90 -acc -Minfo=accel", I receive the following error :
nvvmCompileProgram error: 9.
Error: /tmp/pgacc2lMnIf9lMqx8.gpu (146, 24): parse invalid forward reference to function 'innerroutine_' with wrong type!
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (Test.f90: 1)
This same problem can be reproduced with the following simple fortran program :
PROGRAM Test
IMPLICIT NONE
CONTAINS
SUBROUTINE OuterRoutine( sol, xF, N )
!$acc routine
IMPLICIT NONE
INTEGER :: N
REAL(KIND=8) :: sol(0:N,1:3)
REAL(KIND=8) :: xF(0:N,1:3)
! LOCAL
INTEGER :: i
DO i = 0, N
xF(i,1:3) = InnerRoutine( sol(i,1:3) )
ENDDO
END SUBROUTINE OuterRoutine
FUNCTION InnerRoutine( sol ) RESULT( xF )
!$acc routine
IMPLICIT NONE
REAL(KIND=8) :: sol(1:3)
REAL(KIND=8) :: xF(1:3)
xF(1) = sol(1)*sol(2)
xF(2) = sol(1)*sol(3)
xF(3) = sol(1)*sol(1)
END FUNCTION InnerRoutine
END PROGRAM Test
Again, compiling the above program with "pgf90 -acc -Minfo=accel" yields the problem.
Does openacc support acc-enabled routines calling other acc-enabled routines ?
If so, what am I doing wrong ?

You're using the OpenACC "routine" directive correctly. The problem here is that we (PGI) don't yet support using "routine" with array-valued functions. The problem being that this support requires the compiler to create a temp array to hold the return value. Meaning that every thread would need to allocate this temp array causing a severe performance penalty. Worse is how to handle sharing the temp array if is a gang or worker level routine.
We do have open requests for this feature, but it may be awhile before we can address it. In the meantime, can you try inlining the routine? i.e. compile with "-Minline".

OpenMP race condition (Fortran 77 w/ COMMON block)

I am trying to parallelise some legacy Fortran code with OpenMP.
Checking for race conditions with Intel Inspector, I have come across a problem in the following code (simplified, tested example):
PROGRAM TEST
!$ use omp_lib
implicit none
DOUBLE PRECISION :: x,y,z
COMMON /firstcomm/ x,y,z
!$OMP THREADPRIVATE(/firstcomm/)
INTEGER :: i
!$ call omp_set_num_threads(3)
!$OMP PARALLEL DO
!$OMP+ COPYIN(/firstcomm/)
!$OMP+ PRIVATE(i)
do i=1,3000
z = 3.D0
y = z+log10(z)
x=y+z
enddo
!$OMP END PARALLEL DO
END PROGRAM TEST
Intel Inspector detects a race condition between the following lines:
!$OMP PARALLEL DO (read)
z = 3.D0 (write)
The Inspector "Disassembly" view offers the following about the two lines, respectively (I do not understand much about these, apart from the fact that the memory addresses in both lines seem to be different):
0x3286 callq 0x2a30 <memcpy>
0x3338 movq %r14, 0x10(%r12)
As in my main application, the problem occurs for one (/some) variable in the common block, but not for others that are treated in what appears to be the same way.
Can anyone spot my mistake, or is this race condition a false positive?
I am aware that the use of COMMON blocks, in general, is discouraged, but I am not able to change this for the current project.

Technically speaking, your example code is incorrect since you are using COPYIN to initialise threadprivate copies with data from uninitialised COMMON BLOCK. But that is not the reason for the data race - adding a DATA statement or simply assigning to x, y, and z before the parallel region does not change the outcome.
This is either a (very old) bug in Intel Fortran Compiler, or Intel is interpreting strangely the text of the OpenMP standard (section 2.15.4.1 of the current version):
The copy is done, as if by assignment, after the team is formed and prior to the start of execution of the associated structured block.
Intel implements the emphasised text by inserting a memcpy at the beginning of the outlined procedure. In other words:
!$OMP PARALLEL DO COPYIN(/firstcomm/)
do i = 1, 3000
...
end do
!$OMP END PARALLEL DO
becomes (in a mixture of Fortran and pseudo-code):
par_region0:
my_firstcomm = get_threadprivate_copy(/firstcomm/)
if (my_firstcomm != firstcomm) then
memcpy(my_firstcomm, firstcomm, size of firstcomm)
end if
// Actual implementation of the DO worksharing construct
call determine_iterations(1, 3000, low_it, high_it)
do i = low_it, high_it
...
... my_firstcomm used here instead of firstcomm
...
end do
call openmp_barrier
end par_region0
MAIN:
// Prepare a parallel region with 3 threads
// and fire the outlined code in the worker threads
call start_parallel_region(3, par_region0)
// Fire the outlined code in the master thread
call par_region0
call end_parallel_region
The outlined procedure first finds the address of the threadprivate copy of the common block, then compares that address to the address of the common block itself. If both addresses match, then the code is being executed in the master thread and no copy is needed, otherwise memcpy is called to make a bitwise copy of the master's data into the threadprivate block.
Now, one would expect that there should be a barrier at the end of the initialisation part and right before the start of the loop, and although Intel employees claim that there is one, there is none (tested with ifort 11.0, 14.0, and 16.0). Even more, the Intel Fortran Compiler does not honour the list of variables in the COPYIN clause and copies the entire common block if any variable contained in it is listed in the clause, i.e. COPYIN(x) is treated the same as COPYIN(/firstcomm/).
Whether those are bugs or features of Intel Fortran Compiler, only Intel could tell. It could also be that I'm misreading the assembly output. If anyone could find the missing barrier, please let me know. One possible workaround would be to split the combined directive and insert an explicit barrier before the worksharing construct:
!$OMP PARALLEL COPYIN(/firstcomm/) PRIVATE(I)
!$OMP BARRIER
!$OMP DO
do i = 1, 3000
z = 3.D0
y = z+log10(z)
x = y+z
end do
!$OMP END DO
!$OMP END PARALLEL
With that change, the data race will shift into the initialisation of the internal dispatch table within the log10 call, which is probably a false positive.
GCC implements COPYIN differently. It creates a shared copy of the threadprivate data of the master thread, which copy it then passes on to the worker threads for use in the copy process.

Stack overflow in Fortran 90

I have written a fairly large program in Fortran 90. It has been working beautifully for quite a while, but today I tried to step it up a notch and increase the problem size (it is a research non-standard FE-solver, if that helps anyone...) Now I get the "stack overflow" error message and naturally the program terminates without giving me anything useful to work with.
The program starts with setting up all relevant arrays and matrices, and after that is done it prints a few lines of stats regarding this to a log-file. Even with my new, larger problem, this works fine (albeit a little slow), but then it fails as the "number crunching" gets going.
What confuses me is that everything at that point is already allocated (and that worked without errors). I'm not entirely sure what the stack is (Wikipedia and several treads here didn't do much since I have only a quite basic knowledge of the "behind the scenes" workings of a computer).
Assume that I for instance have some arrays initialized as:
INTEGER,DIMENSION(64) :: IA
REAL(8),DIMENSION(:,:),ALLOCATABLE :: AA, BB
which after some initialization routines (i.e. read input from file and such) are allocated as (I store some size-integers for easier passing to subroutines in IA of fixed size):
ALLOCATE( AA(N1,N2) , BB(N1,N2) )
IA(1) = N1
IA(2) = N2
This is basically what happens in the initial portion, and so far so good. But when I then call a subroutine
CALL ROUTINE_ONE(AA,BB,IA)
And the routine looks like (nothing fancy):
SUBROUTINE ROUTINE_ONE(AA,BB,IA)
IMPLICIT NONE
INTEGER,DIMENSION(64) :: IA
REAL(8),DIMENSION(IA(1),IA(2)) :: AA, BB
...
do lots of other stuff
...
END SUBROUTINE ROUTINE_ONE
Now I get an error! The output to the screen says:
forrtl: severe (170): Program Exception - stack overflow
However, when I run the program with the debugger it breaks at line 419 in a file called winsig.c (not my file, but probably part of the compiler?). It seems to be part of a routine called sigreterror: and it is the default case that has been invoked, returning the text Invalid signal or error. There is a comment line attached to this which strangely says /* should never happen, but compiler can't tell */ ...?
So I guess my question is, why does this happen and what is actually happening? I thought that as long as I can allocate all the relevant memory I should be fine? Does the call to the subroutine make copies of the arguments, or just pointers to them? If the answer is copies then I can see where the problem might be, and if so: any ideas on how to get around it?
The problem I try to solve is big, but not insane in any way. Standard FE-solvers can handle bigger problems than my current one. I run the program on a Dell PowerEdge 1850 and the OS is Microsoft Server 2008 R2 Enterprise. According to systeminfo at the cmd prompt I have 8GB of physical memory and almost 16GB virtual. As far as I understand the total of all my arrays and matrices should not add up to more than maybe 100MB - about 5.5M integer(4) and 2.5M real(8) (which according to me should be only about 44MB, but let's be fair and add another 50MB for overhead).
I use the Intel Fortran compiler integrated with Microsoft Visual Studio 2008.
Adding some actual source code to clarify a bit
! Update continuum state
CALL UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,&
bmtrx,detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
is the actual call to the routine. Big arrays are posc, bmtrx and aa - all other are at least an order of magnitude smaller (if not more). posc is INTEGER(4) and bmtrx and aa is REAL(8)
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
IMPLICIT NONE
!I/O
INTEGER(4) :: iTask, errmsg
INTEGER(4) :: iArray(64)
INTEGER(4),DIMENSION(iArray(15),iArray(15),iArray(5)) :: posc
INTEGER(4),DIMENSION(iArray(22),iArray(21)+1) :: nodedof
INTEGER(4),DIMENSION(iArray(29),iArray(3)+2) :: elm
REAL(8),DIMENSION(iArray(14)) :: dof, dof_k
REAL(8),DIMENSION(iArray(12)*iArray(17),iArray(15)*iArray(5)) :: bmtrx
REAL(8),DIMENSION(iArray(5)*iArray(17)) :: detjac
REAL(8),DIMENSION(iArray(17)) :: w
REAL(8),DIMENSION(iArray(23),iArray(19)) :: mtrlprops
REAL(8),DIMENSION(iArray(8),iArray(8),iArray(23)) :: demtrx
REAL(8) :: dt
REAL(8),DIMENSION(2,iArray(12)*iArray(17)*iArray(5)) :: stress
REAL(8),DIMENSION(iArray(12)*iArray(17)*iArray(5)) :: strain
REAL(8),DIMENSION(2,iArray(17)*iArray(5)) :: effstrain, effstress
REAL(8),DIMENSION(iArray(25)) :: aa
REAL(8),DIMENSION(iArray(14)) :: fi
!Locals
INTEGER(4) :: i, e, mtrl, i1, i2, j1, j2, k1, k2, dim, planetype, elmnodes, &
Nec, elmpnodes, Ndisp, Nstr, Ncomp, Ngpt, Ndofelm
INTEGER(4),DIMENSION(iArray(15)) :: doflist
REAL(8),DIMENSION(iArray(12)*iArray(17),iArray(15)) :: belm
REAL(8),DIMENSION(iArray(17)) :: jelm
REAL(8),DIMENSION(iArray(12)*iArray(17)*iArray(5)) :: dstrain
REAL(8),DIMENSION(iArray(12)*iArray(17)) :: s
REAL(8),DIMENSION(iArray(17)) :: ep, es, dep
REAL(8),DIMENSION(iArray(15),iArray(15)) :: kelm
REAL(8),DIMENSION(iArray(15)) :: felm
dim = iArray(1)
...
And it fails before the last line above.

As per steabert's request, I'll just summarize the conversation in the comments here where it's a bit more visible, even though M.S.B.'s answer already gets right to the nub of the problem.
In technical programming, where procedures often have large local arrays for intermediate computation, this happens a lot. Local variables are generally stored on the stack, which typically (and quite reasonably) a small fraction of overall system memory -- usually of order 10MB or so. When the local variable sizes exceed the stack size, you see exactly the symptoms described here -- a stack overflow occuring after a call to the relevant subroutine but before its first executable statement.
So when this problem happens, the best thing to do is to find the relevant large local variables, and decide what to do. In this case, at least the variables belm and dstrain were getting quite sizable.
Once the variables are located, and you've confirmed that's the problem, there's a few options. As MSB points out, if you can make your arrays smaller, that's one option. Alternatively, you can make the stack size larger; under linux, that's done with ulimit -s [newsize]. That really just postpones the problem, though, and you have to do something different on windows machines.
The other class of ways to avoid this problem is not to put the large data on the stack, but in the rest of memory (the "heap"). You can do that by giving the arrays the save attribute (in C, static); this puts the variable on the heap and thus makes the values persistent between calls. The downside there is that this potentially changes the behavior of the subroutine, and means the subroutine can't be used recursively, and similarly is non-threadsafe (if you're ever in a position where multiple threads will enter the routine simulatneously, they'll each see the same copy of the local varaiable and potentially overwrite each other's results). The upside is that it's easy and very portable -- it should work everywhere. However, this will only work with fixed-size local variables; if the temporary arrays have sizes that depend on the inputs, you can't do this (since there'd no longer be a single variable to save; it could be different size every time the procedure is called).
There are compiler-specific options which put all arrays (or all arrays of larger than some given size) on the heap rather than on the stack; every Fortran compiler I know has an option for this. For ifort, used in the OPs post, it's -heap-arrays in linux, or /heap-arrays for windows. For gfortran, this may actually be the default. This is good for making sure you know what's going on, but it means you have to have different incantations for every compiler to make sure your code works.
Finally, you can make the offending arrays allocatable. Allocated memory goes on the heap; but the variable which points to them is on the stack, so you get the benefits of both approaches. Also, this is completely standard fortran and so totally portable. The downside is that it requires code changes. Also, the allocation process can take nontrivial amounts of time; so if you're going to be calling the routine zillions of times, you may notice this slows things down slightly. (This possible performance regression is easy to fix, though; if you'll be calling it zillions of times with the same size arrays, you can have an optional argument to pass in a pre-allocated local array and use that instead, so that you only allocate/deallocate once).
Allocating/deallocating each time would look like:
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg)
IMPLICIT NONE
!...arguments....
!Locals
!...
REAL(8),DIMENSION(:,:), allocatable :: belm
REAL(8),DIMENSION(:), allocatable :: dstrain
allocate(belm(iArray(12)*iArray(17),iArray(15))
allocate(dstrain(iArray(12)*iArray(17)*iArray(5))
!... work
deallocate(belm)
deallocate(dstrain)
Note that if the subroutine does a lot of work (eg, takes seconds to execute), the overhead from a couple allocate/deallocates should be negligable. If not, and you want to avoid the overhead, using the optional arguments for preallocated worskpace would look something like:
SUBROUTINE UpdateContinuumState(iTask,iArray,posc,dof,dof_k,nodedof,elm,bmtrx,&
detjac,w,mtrlprops,demtrx,dt,stress,strain,effstrain,&
effstress,aa,fi,errmsg,workbelm,workdstrain)
IMPLICIT NONE
!...arguments....
real(8),dimension(:,:), optional, target :: workbelm
real(8),dimension(:), optional, target :: workdstrain
!Locals
!...
REAL(8),DIMENSION(:,:), pointer :: belm
REAL(8),DIMENSION(:), pointer :: dstrain
if (present(workbelm)) then
belm => workbelm
else
allocate(belm(iArray(12)*iArray(17),iArray(15))
endif
if (present(workdstrain)) then
dstrain => workdstrain
else
allocate(dstrain(iArray(12)*iArray(17)*iArray(5))
endif
!... work
if (.not.(present(workbelm))) deallocate(belm)
if (.not.(present(workdstrain))) deallocate(dstrain)

Not all of the memory is created when the program starts. When you call the subroutine the executable is creating the memory that the subroutine needs for local variables. Typically arrays with simple declarations that are local to that subroutine -- neither allocatable, nor pointer -- are allocated on the stack. You could have simply run of of stack space when you reached these declarations. You might have reached a 2GB limit on a 32-bit OS with some array. Sometimes executable statements implicitly create a temporary array on the stack.
Possible solutions: 1) make your arrays smaller (not attractive), 2) make the stack larger), 3) some compilers have options to switch from placing arrays on the stack to dynamically allocating them, similar to the method used for "allocate", 4) identify large arrays and make them allocatable.

The stack is the memory area where the information needed to return from a function, and the information locally defined in a function is stored. So a stack overflow may indicate you have a function that calls another function which in its turn calls another function, etc.
I am not familiar with Fortran (anymore) but another cause might be that those functions declare tons of local variables, or at least variables that need a lot of place.
A last one: the stack is typically rather small, so it's not a priori relevant how much memory the machine has. It should be quite simple to instruct the linker to increase the stack size, at least if you are certain it's just a lack of space, and not a bug in your application.
Edit: do you use recursion in your program? Recursive calls can eat through the stack very quickly.
Edit: have a look at this: (emphasis mine)
On Windows, the stack space to
reserved for the program is set using
the /Fn compiler option, where n is
the number of bytes. Additionally,
the stack reserve size can be
specified through the Visual Studio
IDE which adds the Microsoft Linker
option /STACK: to the linker command
line. To set this, go to Property
Pages>Configuration
Properties>Linker>System>Stack Reserve
Size. There you can specify the stack
size in bytes in either decimal or
C-language notation. If not specified,
the default stack size is 1MB.

The only problem I ran into with a similar test code, is the 2Gb allocation limit for 32-bit compilation. When I exceed it I get an error message on line 419 in winsig.c
Here is the test code
program FortranCon
implicit none
! Variables
INTEGER :: IA(64), S1
REAL(8), DIMENSION(:,:), ALLOCATABLE :: AA, BB
REAL(4) :: S2
INTEGER, PARAMETER :: N = 10960
IA(1)=N
IA(2)=N
ALLOCATE( AA(N,N), BB(N,N) )
AA(1:N,1:N) = 1D0
BB(1:N,1:N) = 2D0
CALL TEST(AA,BB,IA)
S1 = SIZEOF(AA) !Size of each array
S2 = 2*DBLE(S1)/1024/1024 !Total size for 2 arrays in Mb
WRITE (*,100) S2, ' Mb' ! When allocation reached 2Gb then
100 FORMAT (F8.1,A) ! exception occurs in Win32
DEALLOCATE( AA, BB )
end program FortranCon
SUBROUTINE TEST(AA,BB,IA)
IMPLICIT NONE
INTEGER, DIMENSION(64),INTENT(IN) :: IA
REAL(8), DIMENSION(IA(1),IA(2)),INTENT(INOUT) :: AA,BB
... !Do stuff with AA,BB
END SUBROUTINE
When N=10960 it runs ok showing 1832.9 Mb. With N=11960 it crashes. Of course when I compile with x64 it works ok. Each array has 8*N^2 bytes storage. I don't know if it helps but I recommend using the INTENT() keywords for the dummy variables.

Are you using some parallelization? This can be a problem with statically declared arrays. Try all bigger arrays make ALLOCATABLE, otherwise, they will be placed on the stack in autoparallel or OpenMP threads.

For me the issue was the stack reserve size. I went and changed the stack reserved size from 0 to 100000000 and recompiled the code. The code now runs smoothly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js