Abaqus Package_dp terminated by signal 11 - fortran

I'm trying to run a VUEL subroutine on abaqus 2022. I have been encountering a memory allocation error (signal 11). This error occurs due to a single tensor that is summarized below. The total size is not near the abaqus tensor size limit, yet a get a memory allocation error.
Exact error text:
*** ABAQUS/package_dp rank 0 terminated by signal 11
*** ERROR CATEGORY: PACKAGE
Abaqus Error: Abaqus/Explicit Packager exited with an error - Please see the
status file for possible error messages if the file exists.
Abaqus/Analysis exited with errors
Code:
...
maxelno = 14000
double precision, dimension(maxelno, 8, 30) :: uel_isvs
...
If I change the size of maxelno to 4200 then the subroutine will run. Likewise, if I use a one-dimensional dummy variable double precision, dimension(maxelno) :: dummy I also get the same error as mentioned above.
I am running on a linux cluster with 700gb of ram and 42 cores, there shouldn't be any issue with memory limits either. Any help would be appreciated.

Related

SIGBUS occurs when fortran code reads file on linux cluster

When I run my fortran code in parallel on a linux cluster with mpirun I get a sigbus error.
It occurs while reading a file, the timing is irregular, and sometimes it proceeds without error.
I have tried debug compilation options like -g, but I haven't gotten any information on what line the error is coming from.
Actually the code was executed previously in three different clusters without this error, but the error is only occurring on this machine.
I personally suspect this is related to the performance of the machine (especially storage i/o), but I am not sure.
The program code is simple. Each process executed by mpirun reads the file corresponding to its rank as follows.
!!!!!!!!!! start of code
OPEN(11, FILE='FILE_NAME_WITH_RANK', FORM='UNFORMATTED')
READ(11,*) ISIZE
ALLOCATE(SOME_VARIABLE(ISIZE))
DO I = 1, ISIZE
READ(11,*) SOME_VARIABLE(I)
ENDDO
READ(11,*) ISIZE2
ALLOCATE(SOME_VARIABLE2(ISIZE2))
DO I = 1, ISIZE2
READ(11,*) SOME_VARIABLE2(I)
ENDDO
! MORE VARIABLES
CLOSE(11)
!!!!!!!!!! end of code
I used 191 cpu, and the total size of 191 files it loads is about 11 GB.
The cluster used for execution consists of 24 nodes with 16 cpu each (384 cpu total) and uses common storage that is shared with another cluster.
I ran the code in parallel by specifying nodes 1 through 12 as the hostfile.
Initially, I had 191 cpu read all files at the same time out of sequence.
After doing so, the program ended with a sigbus error. Also, for some nodes, the ssh connection was delayed, and the bashrc file cannot be found by node with stale file handle error.
The stale file handle error waited a bit and it seemed to recover by itself, but I'm not sure what the system administrator did.
So, I changed it to the following code so that only one cpu can read the file at a time.
!!!!!!!!!! start of code
DO ICPU = 0, NUMBER_OF_PROCESS-1
IF(ICPU.EQ.MY_PROCESS) CALL READ_FILE
CALL MPI_BARRIER(MPI_COMMUNICATOR,IERR)
ENDDO
!!!!!!!!!! end of code
This seemed to work fine for single execution, but if I ran more than one of these programs at the same time, the first mpirun stopped and both ended with a sigbus error eventually.
My next attempt is to minimize the execution of the read statement by deleting the do statement when reading the array. However, due to limited time, I couldn't test the effectiveness of this modification.
Here are some additional information.
If I execute a search or copy a file with an explorer such as nautilus while running a parallel program, nautilus does not respond or the running program raise sigbus. In severe cases, I wasn't able to connect the VNC server with stale file handle errors.
I use OpenMPI 2.1.1, GNU Fortran 4.9.4.
I compile the program with following
$OPENMPIHOME/bin/mpif90 -mcmodel=large -fmax-stack-var-size-64 -cpp -O3 $SOURCE -o $EXE
I execute the program with following in gnome terminal
$OPENMPIHOME/bin/mpirun -np $NP -x $LD_LIBRARY_PATH --hostfile $HOSTFILE $EXE
The cluster is said to be running commercial software like FLUENT without problems.
Summing up the above, my personal suspicion is that the storage of the cluster is dismounted due to the excessive disk I/O generated by my code, but I don't know if this makes sense because I have no cluster knowledge.
If yes, I wonder if there is a way to minimize the disk I/O, if it is enough to proceed with the vectorized I/O mentioned above, or if there is an additional part.
I would appreciate it if you could tell me anything about the problem.
Thanks in advance.
!!!
I wrote an example code. As mentioned above, it may not be easy to reproduce because the occurrence varies depending on the machine.
PROGRAM BUSWRITE
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
ARRAY1(I) = I
WRITE(11) ARRAY1(I)
ENDDO
DO I = 1, ISIZE2
ARRAY2(I) = I**2
WRITE(11) ARRAY2(I)
ENDDO
DO I = 1, ISIZE3
ARRAY3(I) = I**3
WRITE(11) ARRAY3(I)
ENDDO
CLOSE(11)
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./buswrite.f90 -o ./buswrite
mpirun -np 32 ./buswrite
I've got 32 000.DAT ~ 031.DAT
PROGRAM BUSREAD
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
READ(11) ARRAY1(I)
IF(ARRAY1(I).NE.I) STOP
ENDDO
DO I = 1, ISIZE2
READ(11) ARRAY2(I)
IF(ARRAY2(I).NE.I**2) STOP
ENDDO
DO I = 1, ISIZE3
READ(11) ARRAY3(I)
IF(ARRAY3(I).NE.I**3) STOP
ENDDO
CLOSE(11)
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
IF(MYPE.EQ.0) WRITE(*,*) 'GOOD'
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./busread.f90 -o ./busread
mpirun -np 32 ./busread
I've got 'GOOD' output text from terminal as expected, but the machine in question is terminated with a sigbus error while running busread.
The issue was not observed after a device reboot. Even though I ran 4 programs at the same time under the same conditions, no problem occurred. In addition, other teams that used the device also had similar problems, which were resolved after reboot. The conclusion is a bit ridiculous, but if there are any people experiencing similar problems, I would like to summarize it as follows.
If your program terminates abnormally due to a memory error (like sigbus and sigsegv) while reading or writing a file, you can check the following.
Make sure there are no errors in your program. Check whether the time of occurrence of the error is constant or irregular, whether other programs have the same symptoms, whether it runs well on other machines, and whether there is a problem when run with a memory error checking tool such as valgrind.
Optimize the file I/O part. In the case of fortran, processing an entire array is tens of times faster than processing by element.
Immediately after an error occurs, try ssh connection to the machine (or node) to check whether the connection is smooth and that the file system is well accessed. If you cannot access the bashrc file or an error such as stale file handle occurs, please contact the system manager after combining the above reviewed information.
If someone has anything to add or if this post isn't appropriate, please let me know.

How to fix RuntimeError & Segmentation fault while running demo codes in auto-07p

I was running demo codes in AUTO-07p, which is mainly based on fortran but also following python syntax, but I was getting same error when I run most of demo codes. Since they're not written by me, but written by experts and has been released for a long time, so there must NOT an error. Maybe I don't have to modify the codes.
So when I run codes containing either bifurcation analysis part or plotting, I kept getting an error message RuntimeError: maximum recursion depth exceeded while calling a Python object.
So, I tried to change recursion limit to 10^7, but then I got the error message "Segmentation fault (core dumped)
I'm using ubuntu 16.04 LTS
demo('bvp')
bvp =run('bvp')
branchpoints = bvp("BP")
for solution in branchpoints:
bp = load(solution,ISW=-1,NTST=50)
# Compute forwards
print "Solution label",bp["LAB"],"forwards"
fw = run(bp)
# Compute backwards
print "Solution lavel", bp["LAB"], "backwards"
bw = run(bp,DS='-')
both = fw + bw
merged = merge(both)
bvp = bvp + merged
bvp=relavel(bvp)
save(bvp, 'bvp')
plot(bvp)
wait()
The error message says (excluding seemingly unnecessary lines)
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in flatnonzero(a)
924
925
-->926 return np.nonzero(np.ravel(a))[0]
927
928
... last 1 frames repeated, from the frame below ...
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in flatnonzero(a)
924
925
-->926 return np.nonzero(np.ravel(a))[0]
927
928
RuntimeError: maximum recursion depth exceeded while calling a Python object
So it seems the code is stuck in an infinite loop, but that flatnonzero function is simply indexing nonzero entries... I'm not sure what should I do

SAS Management Console, Exited with exit code 116, ERROR: Unrecognized SAS option name

I'm using sas management console to schedule job. The job end almost immediately and the error i recieve by email is :
Exited with exit code 116.
Resource usage summary:
CPU time : 0.28 sec.
Max Memory : 14 MB
Total Requested Memory : -
Delta Memory : -
(Delta: the difference between total requested memory and actual max usage.)
The output (if any) follows:
ERROR: Unrecognized SAS option name LOGU+00A0/FILEPATH/LOGNAME.LOG.
ERROR: (SASXKRIN): KERNEL RESOURCE INITIALIZATION FAILED.
ERROR: Unable to initialize the SAS kernel.
Any ideas what the issue could be ?
Thanks in advance!

memory error with EigenAllocator

When I training my model, it has no error. But when the code switches from one epoch to another epoch, it report the following error
E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
F tensorflow/core/common_runtime/gpu/gpu_device.cc:104] EigenAllocator for GPU ran out of memory when allocating 0. See error logs for more detailed info.
How to solve this error?

How to find the origin of MPI message truncated errors?

I am currently having problems with a MPI Application.
I am sporadically receiving MPI errors of the form:
Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(1339)...............: MPI_Allreduce(sbuf=0x7ffa87ffcb98, rbuf=0x7ffa87ffcba8, count=2, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1180).........:
MPIR_Allreduce_intra(755).........:
MPIDI_CH3U_Receive_data_found(129): Message from rank 0 and tag 14 truncated; 384 bytes received but buffer size is 16
rank 1 in job 1 l1442_42561 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
However I do not know at where to look. I know that the error is happening in an Allreduce function call however there are multiple ones.
How do I know which function call produces the error? Simple printf debugging does not help as the function could be called a million times before the error occurs the first time.
It might also not occur at all or immediately after the start of the program.
I have been able to track down the origin of the error by calling
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN)
and then checking the return value of each of the Allreduce functions for not being equal to MPI_SUCCESS. This is a location where an error occurs