MPI_WTIME is not giving me speedup as required - fortran

Program Main
implicit none
include 'mpif.h'
!Define parameters
integer::my_rank,p2,n2,ierr,source
integer, parameter :: n=3,m=3,o=m*n
real(kind=8) aaa(n),ddd(n),bbb(n),ccc(n),xxx(n),b(m,n),start, finish
integer i, j
real h
real(kind=8),dimension(:),allocatable::sol1
h=0.25
b=0
do i=1,m
b(i,i)=1/(1.2**i)
b(i,i-1)=-b(i,i)
enddo
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,p2,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank,ierr)
allocate(sol1(o))
start=MPI_WTIME()
do i=1,n
aaa(i)=-1/h**2
bbb(i)=2/h**2+b(my_rank+1,my_rank+1)
ccc(i)=-1/h**2
ddd(i)=1/h**2
enddo
call thomas(aaa,bbb,ccc,ddd,xxx,n)
finish=MPI_WTIME()
print*, finish-start
write(*,*) xxx, my_rank
call MPI_GATHER(xxx,n, MPI_REAL, sol1,n,MPI_REAL8,0, MPI_COMM_WORLD,ierr)
print*,sol1
call MPI_FINALIZE(ierr)
end program main
subroutine thomas(ld,md,ud,rh,solution,n)
implicit none
integer,parameter :: r8 = kind(1.d0)
integer,intent(in) :: n
real(r8),dimension(n),intent(in) :: ld,md,ud,rh
real(r8),dimension(n),intent(out) :: solution
real(r8),dimension(n) :: P,Q
real(r8) :: m
integer i
P(1) = ud(1)/md(1)
Q(1) = rh(1)/md(1)
do i = 2,n
m = md(i)-p(i-1)*ld(i)
P(i) = ud(i)/m
Q(i) = (rh(i)-Q(i-1)*ld(i))/m
end do
solution(n) = Q(n)
do i = n-1, 1, -1
solution(i) = Q(i)-P(i)*solution(i+1)
end do
end subroutine thomas
Here I used MPI_WTIME() to find the execution time. It seems like when I increase the number of processor than I am not getting the speedup. In this code I have m=3 (I make m equal equal to no of processor). I run with mpirun -np 3 sp.exe). Now I change say m=10 and run with mpirun -np 10 sp.exe. I should get the less time, isn't it? or I am missing something here. The community helped me before with some issues and now I am getting another issue. I would really appreciate the help if somebody would point out something.Isn't the chunk of code starting with do loop done by invidual processors( which I want)?

Related

Do loop inside a where block in Fortran

Even though I do not exactly know why, it seems that Fortran (90) does not allow do loops inside where blocks. A code structured as follows does not compile with gfortran (Unexpected DO statement in WHERE block at (1)):
real :: a(30), b(30,10,5)
integer :: i, j
do i=1,10
where(a(:) > 0.)
! do lots of calculations
! modify a
do j=1,5
b(:,i,j)=...
enddo
endwhere
enddo
The only workaround that I can think of would be
real :: a2(30)
do i=1,10
a2(:)=a(:)
where(a(:) > 0.)
! do lots of calculations
! modify a
endwhere
do j=1,5
where(a2(:) > 0.)
b(:,i,j)=...
endwhere
enddo
enddo
I suppose that there are more elegant solutions? Especially if the where condition is less straightforward, this will look messy pretty soon... Thanks!
If your arrays are all 1-indexed, you can replace where constructs by explicitly storing and using array masks, for example
program test
implicit none
real :: a(30), b(30,10,5)
integer :: i, j
integer, allocatable :: mask(:)
do i=1,10
mask = filter(a(:)>0.)
! do lots of calculations
! modify a
do j=1,5
b(mask,i,j)=...
enddo
enddo
contains
! Returns the array of indices `i` for which `input(i)` is `true`.
function filter(input) result(output)
logical, intent(in) :: input(:)
integer, allocatable :: output(:)
integer :: i
output = pack([(i,i=1,size(input))], mask=input)
end function
end program

Fortran low performance with allocatable arrays

I use Intel Visual Fortran, both IVF2013 and IVF2019. When using allocatable arrays, the program is much slower than the one using static memory allocation. That is to say, if I change from
Method 1: by using fixed array
do i = 1, 1000
call A
end do
subroutine A
real(8) :: x(30)
do things
end subroutine A
to something like
Method 2: by using allocatable arrays
module module_size_is_defined
n = 30
end module
do i = 1, 1000
call A
end do
subroutine A
use module_size_is_defined
real(8), allocatable :: x(:)
allocate(x(n))
do things
end subroutine A
The code is much slower. For my code, the static allocation takes 1 minutes 30 seconds while the dynamic allocation takes 2 minutes and 30 seconds. Then, I thought is might because that the allocation action was run takes too much time as it is in the loop, then I tried following two methods:
Method 3: by using the module to allocate the array only once
module module_x_is_allocated
n = 30
allocat(x(n))
end module
do i = 1, 1000
call A
end do
subroutine A
use module_x_is_allocated
do things
end subroutine A
Method 4: by using automatic array
module module_size_is_defined
n = 30
end module
do i = 1, 1000
call A
end do
subroutine A
use module_size_is_defined
real(x) :: x(n)
do things
end subroutine A
Both Method 3 and Method 4 take almost the same time of the one using dynamic allocated array Method 2. Both around 2 mins 30s. All cases are compiling with same optimization. I tried IVF 2013 and IVF 2019, and same results. I don't know why. Especially for Method 3, although the allocate is only run once, it still takes the same time. It seems that dynamic allocated array is stored at the place that is slower than the static allocated array, and allocation does not take extra time (since method 2 and 3 take the same time).
Any ideas and suggestions that to allocate the arrays in a more efficient manner to reduce the performance penalty? Thanks.
!=========================================================================
Edit 1:
My program is too long to post here. Thus, I tried a few small codes. The results are a little bit strange. I tried three cases,
Method 1: takes 28.98s
module module_size_is_defined
implicit none
integer(4) :: n
end module
program main
use module_size_is_defined
implicit none
integer(4) :: i
real(8) :: y(50,50),z(50,50),t
n = 50
do i =1,50000
t=dble(i) * 2.0D0
call A(y,t)
z = z + y
end do
write(*,*) z(1,1)
end
subroutine A(y,t)
use module_size_is_defined
implicit none
real(8),intent(out):: y(n,n)
real(8),intent(in) :: t
integer(4) :: j
real(8) :: x(1,50)
y=0.0D0
do j = 1, 200
call getX(x,t,j)
y = y + matmul( transpose(x) + dble(j)**2, x )
end do
endsubroutine A
subroutine getX(x,t,j)
use module_size_is_defined
implicit none
real(8),intent(out) :: x(1,n)
real(8),intent(in) :: t
integer(4),intent(in) :: j
integer(4) :: i
do i =1, n
x(1,i) = dble(i+j) * t ** (1.5D00)
end do
endsubroutine getX
Method 2: takes 30.56s
module module_size_is_defined
implicit none
integer(4) :: n
end module
program main
use module_size_is_defined
implicit none
integer(4) :: i
real(8) :: y(50,50),z(50,50),t
n = 50
do i =1,50000
t=dble(i) * 2.0D0
call A(y,t)
z = z + y
end do
write(*,*) z(1,1)
end
subroutine A(y,t)
use module_size_is_defined
implicit none
real(8),intent(out):: y(n,n)
real(8),intent(in) :: t
integer(4) :: j
real(8),allocatable :: x(:,:)
allocate(x(1,n))
y=0.0D0
do j = 1, 200
call getX(x,t,j)
y = y + matmul( transpose(x) + dble(j)**2, x )
end do
endsubroutine A
subroutine getX(x,t,j)
use module_size_is_defined
implicit none
real(8),intent(out) :: x(1,n)
real(8),intent(in) :: t
integer(4),intent(in) :: j
integer(4) :: i
do i =1, n
x(1,i) = dble(i+j) * t ** (1.5D00)
end do
endsubroutine getX
Method 3: takes 78.72s
module module_size_is_defined
implicit none
integer(4) :: n
endmodule
module module_array_is_allocated
use module_size_is_defined
implicit none
real(8), allocatable,save :: x(:,:)
contains
subroutine init
implicit none
allocate(x(1,n))
endsubroutine
endmodule module_array_is_allocated
program main
use module_size_is_defined
use module_array_is_allocated
implicit none
integer(4) :: i
real(8) :: y(50,50),z(50,50),t
n = 50
call init
do i =1,50000
t=dble(i) * 2.0D0
call A(y,t)
z = z + y
end do
write(*,*) z(1,1)
end
subroutine A(y,t)
use module_size_is_defined
use module_array_is_allocated
implicit none
real(8),intent(out):: y(n,n)
real(8),intent(in) :: t
integer(4) :: j
y=0.0D0
do j = 1, 200
call getX(x,t,j)
y = y + matmul( transpose(x) + dble(j)**2, x )
end do
endsubroutine A
subroutine getX(x,t,j)
use module_size_is_defined
implicit none
real(8),intent(out) :: x(1,n)
real(8),intent(in) :: t
integer(4),intent(in) :: j
integer(4) :: i
do i =1, n
x(1,i) = dble(i+j) * t ** (1.5D00)
end do
endsubroutine getX
Now, with samller size problem, Method 1 and Method 2 is almost same time. But Method 3 should be better than Method 2, since it only allocate x(1,n) once. But it is much slower. But in my previous program, Method 2 gives almost the same time as Method 3. It is strange.
I complied in both Windows and Linux, with release setup, -O2 Optimization, with different version of IVF.

Fortran character format string as subroutine argument

I am struggling with reading a text string in. Am using gfortran 4.9.2.
Below I have written a little subroutine in which I would like to submit the write format as argument.
Ideally I'd like to be able to call it with
call printarray(mat1, "F8.3")
to print out a matrix mat1 in that format for example. The numbers of columns should be determined automatically inside the subroutine.
subroutine printarray(x, udf_temp)
implicit none
real, dimension(:,:), intent(in) :: x ! array to be printed
integer, dimension(2) :: dims ! array for shape of x
integer :: i, j
character(len=10) :: udf_temp ! user defined format, eg "F8.3, ...
character(len = :), allocatable :: udf ! trimmed udf_temp
character(len = 10) :: udf2
character(len = 10) :: txt1, txt2
integer :: ncols ! no. of columns of array
integer :: udf_temp_length
udf_temp_length = len_trim(udf_temp)
allocate(character(len=udf_temp_length) :: udf)
dims = shape(x)
ncols = dims(2)
write (txt1, '(I5)') ncols
udf2 = trim(txt1)//adjustl(udf)
txt2 = "("//trim(udf2)//")"
do i = 1, dims(1)
write (*, txt2) (x(i, j), j = 1, dims(2)) ! this is line 38
end do
end suroutine printarray
when I set len = 10:
character(len=10) :: udf_temp
I get compile error:
call printarray(mat1, "F8.3")
1
Warning: Character length of actual argument shorter than of dummy argument 'udf_temp' (4/10) at (1)
When I set len = *
character(len=*) :: udf_temp
it compiles but at runtime:
At line 38 of file where2.f95 (unit = 6, file = 'stdout')
Fortran runtime error: Unexpected element '( 8
What am I doing wrong?
Is there a neater way to do this?
Here's a summary of your question that I will try to address: You want to have a subroutine that will print a specified two-dimensional array with a specified format, such that each row is printed on a single line. For example, assume we have the real array:
real, dimension(2,8) :: x
x = reshape([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], shape=[2,8], order=[2,1])
! Then the array is:
! 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000
! 9.000 10.000 11.000 12.000 13.000 14.000 15.000 16.000
We want to use the format "F8.3", which prints floating point values (reals) with a field width of 8 and 3 decimal places.
Now, you are making a couple of mistakes when creating the format within your subroutine. First, you try to use udf to create the udf2 string. This is a problem because although you have allocated the size of udf, nothing has been assigned to it (pointed out in a comment by #francescalus). Thus, you see the error message you reported: Fortran runtime error: Unexpected element '( 8.
In the following, I make a couple of simplifying changes and demonstrate a few (slightly) different techniques. As shown, I suggest the use of * to indicate that the format can be applied an unlimited number of times, until all elements of the output list have been visited. Of course, explicitly stating the number of times to apply the format (ie, "(8F8.3)" instead of "(*(F8.3))") is fine, but the latter is slightly less work.
program main
implicit none
real, dimension(2,8) :: x
character(len=:), allocatable :: udf_in
x = reshape([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], shape=[2,8], order=[2,1])
udf_in = "F8.3"
call printarray(x, udf_in)
contains
subroutine printarray(x, udf_in)
implicit none
real, dimension(:,:), intent(in) :: x
character(len=*), intent(in) :: udf_in
integer :: ncols ! size(x,dim=2)
character(len=10) :: ncols_str ! ncols, stringified
integer, dimension(2) :: dims ! shape of x
character(len=:), allocatable :: udf0, udf1 ! format codes
integer :: i, j ! index counters
dims = shape(x) ! or just use: ncols = size(x, dim=2)
ncols = dims(2)
write (ncols_str, '(i0)') ncols ! use 'i0' for min. size
udf0 = "(" // ncols_str // udf_in // ")" ! create string: "(8F8.3)"
udf1 = "(*(" // udf_in // "))" ! create string: "(*(F8.3))"
print *, "Version 1:"
do i = 1, dims(1)
write (*, udf0) (x(i, j), j = 1,ncols) ! implied do-loop over j.
end do
print *, "Version 2:"
do i = 1, dims(1)
! udf1: "(*(F8.3))"
write (*, udf1) (x(i, j), j = 1,ncols) ! implied do-loop over j
end do
print *, "Version 3:"
do i = 1, size(x,dim=1) ! no need to create nrows/ncols vars.
write(*, udf1) x(i,:) ! let the compiler handle the extents.
enddo
end subroutine printarray
end program main
Observe: the final do-loop ("Version 3") is very simple. It does not need an explicit count of ncols because the * takes care of it automatically. Due to its simplicity, there is really no need for a subroutine at all.
besides the actual error (not using the input argument), this whole thing can be done much more simply:
subroutine printarray(m,f)
implicit none
character(len=*)f
real m(:,:)
character*10 n
write(n,'(i0)')size(m(1,:))
write(*,'('//n//f//')')transpose(m)
end subroutine
end
note no need for the loop constructs as fortran will automatically write the whole array , line wrapping as you reach the length of data specified by your format.
alternately you can use a loop construct, then you can use a '*' repeat count in the format and obviate the need for the internal write to construct the format string.
subroutine printarray(m,f)
implicit none
character(len=*)f
real m(:,:)
integer :: i
do i=1,size(m(:,1))
write(*,'(*('//f//'))')m(i,:)
enddo
end subroutine
end

Fortran strange segmentation fault

I have some a problem with my main code, so I tried to isolate the problem.
Therefore, I have this small code :
MODULE Param
IMPLICIT NONE
integer, parameter :: dr = SELECTED_REAL_KIND(15, 307)
integer :: D =3
integer :: Q=10
integer :: mmo=16
integer :: n=2
integer :: x=80
integer :: y=70
integer :: z=20
integer :: tMax=8
END MODULE Param
module m
contains
subroutine compute(f, r)
USE Param, ONLY: dr, mmo, x, y, z, n
IMPLICIT NONE
real (kind=dr), intent(in) :: f(x,y,z, 0:mmo, n)
real (kind=dr), intent(out) :: r(x, y, z, n)
real (kind=dr) :: fGlob(x,y,z, 0:mmo)
!-------------------------------------------------------------------------
print*, 'We are in compute subroutine'
r= 00.0
fGlob=sum(f,dim=5)
r=sum(f, dim=4)
print*, 'fGlob=', fGlob(1,1,1, 1)
print*, 'f=', f(1,1,1, 0,1)
print*, 'r=', r(1,1,1, 1)
end subroutine compute
end module m
PROGRAM test_prog
USE Param
USE m
Implicit None
integer :: tStep
real (kind=dr), dimension(:,:,:, :,:), allocatable :: f
real (kind=dr), dimension(:,:,:,:), allocatable :: r
!----------------------------------------------------------------------------
! Initialise the parameters.
print*, 'beginning of the test'
! Allocate
allocate(f(x,y,z, 0:mmo,n))
allocate(r(x,y,z, n))
f=1.0_dr
! ---------------------------------------------------------
! Iteration over time
! ---------------------------------------------------------
do tStep = 1, tMax
print *, tStep
call compute(f,r)
f=f+1
print *, 'tStep', tStep
enddo
print*, 'f=', f(1,1,1, 0,1)
print*, 'r=', r(1,1,1, 1)
! Deallacation
deallocate(f)
deallocate(r)
print*, 'End of the test program'
END PROGRAM test_prog
For now, I am not able to understand why when I compile with ifort, I have a segmentation fault, and it works when I compile with gfortran. And worst, when I compile with both ifort and gfortran with their fast options, I get again a segmentation fault (core dumped) error. And more confusing, when I also tried with both compilers to compile with traceback options, everything works fine.
I know that segmentation fault (core dumped) error usually means that I try to read or write in a wrong location (matrix indices etc...); but here with this small code, I see no mistake like this.
Does anyone can help me to understand why theses errors occur?
The problem comes from the size of the stack used by some compilers by default (ifort) or by some others when they optimise the compilation (gfortran -Ofast). Here, our writings exceed the size of the stack.
To solve this, I use the options -heap-arrays for ifort compiler and -fno-stack-arrays for gfortran compiler.

Using mpi_scatterv with 4D fortran array

I'm trying to break up a 4D array over the third dimension, and send to each node using MPI. Basically, I'm computing derivatives of a matrix, Cpq, with respect to atom positions in each of the three cartesian directions. Cpq is of size nat_sl x nat_sl, so dCpqdR is of size nat_sl x nat_sl x nat x 3. At the end of the day, for ever s,i pair, I have to compute the matrix product of dCpqdR between the transpose of the eigenvectors of Cpq and the eigenvectors of Cpq like so:
temp = MATMUL(TRANSPOSE(Cpq), MATMUL(dCpqdR(:, :, s, i), Cpq))
This is fine, but as it turns out, the loop over s and i is now by far the slow part of my code. Because each can be done independently, I was hoping that I could break up dCpqdR, and give each task it's own s, i to compute the derivative of. That is, I'd like task 1 to get dCpqdR(:,:,1,1), task 2 to get dCpqdR(:,:,1,2), etc.
I've got this working in some sense by using a buffered send/recv pair of calls. The root node allocates a temporary array, fills it, sends to the relevant nodes, and the relevant nodes do their computations as they wish. This is fine, but can be slow and memory inefficient. I'd ideally like to break it up in a more memory efficient way.
The logical thing to do, then, is to use mpi_scatterv, but here is where I start running into trouble, as I'm having trouble figuring out the memory layout for this. I've written this, so far:
call mpi_type_create_subarray(4, (/ nat_sl, nat_sl, nat, 3 /), (/nat_sl, nat_sl, n_pairs(me_image+1), 3/),&
(/0, 0, 0, 0/), mpi_order_fortran, mpi_double_precision, subarr_typ, ierr)
call mpi_type_commit(subarr_typ, ierr)
call mpi_scatterv(dCpqdR, n_pairs(me_image+1), f_displs, subarr_typ,&
my_dCpqdR, 3*nat_sl*3*nat_sl*3*n_pairs(me_image+1), subarr_typ,&
root_image, intra_image_comm, ierr)
I've computed n_pairs using this subroutine:
subroutine mbdvdw_para_init_int_forces()
implicit none
integer :: p, s, i, counter, k, cpu_ind
integer :: num_unique_rpq, n_pairs_per_proc, cpu
real(dp) :: Rpq(3), Rpq_norm, current_val
num_pairs = nat
if(.not.allocated(f_cpu_id)) allocate(f_cpu_id(nat, 3))
n_pairs_per_proc = floor(dble(num_pairs)/nproc_image)
cpu = 0
n_pairs = 0
counter = 1
p = 1
do counter = 0, num_pairs-1, 1
n_pairs(modulo(counter, nproc_image)+1) = n_pairs(modulo(counter, nproc_image)+1) + 1
end do
do s = 1, nat, 1
f_cpu_id(s) = cpu
if((counter.lt.num_pairs)) then
if(p.eq.n_pairs(cpu+1)) then
cpu = cpu + 1
p = 0
end if
end if
p = p + 1
end do
call mp_set_displs( n_pairs, f_displs, num_pairs, nproc_image)
f_displs = f_displs*nat_sl*nat_sl*3
end subroutine mbdvdw_para_init_int_forces
and the full method for the matrix multiplication is
subroutine mbdvdw_interacting_energy(energy, forcedR, forcedh, forcedV)
implicit none
real(dp), intent(out) :: energy
real(dp), dimension(nat, 3), intent(out) :: forcedR
real(dp), dimension(3,3), intent(out) :: forcedh
real(dp), dimension(nat), intent(out) :: forcedV
real(dp), dimension(3*nat_sl, 3*nat_sl) :: temp
real(dp), dimension(:,:,:,:), allocatable :: my_dCpqdR
integer :: num_negative, i_atom, s, i, j, counter
integer, parameter :: eigs_check = 200
integer :: subarr_typ, ierr
! lapack work variables
integer :: LWORK, errorflag
real(dp) :: WORK((3*nat_sl)*(3+(3*nat_sl)/2)), eigenvalues(3*nat_sl)
call start_clock('mbd_int_energy')
call mp_sum(Cpq, intra_image_comm)
eigenvalues = 0.0_DP
forcedR = 0.0_DP
energy = 0.0_DP
num_negative = 0
forcedV = 0.0_DP
errorflag=0
LWORK=3*nat_sl*(3+(3*nat_sl)/2)
call DSYEV('V', 'U', 3*nat_sl, Cpq, 3*nat_sl, eigenvalues, WORK, LWORK, errorflag)
if(errorflag.eq.0) then
do i_atom=1, 3*nat_sl, 1
!open (unit=eigs_check, file="eigs.tmp",action="write",status="unknown",position="append")
! write(eigs_check, *) eigenvalues(i_atom)
!close(eigs_check)
if(eigenvalues(i_atom).ge.0.0_DP) then
energy = energy + dsqrt(eigenvalues(i_atom))
else
num_negative = num_negative + 1
end if
end do
if(num_negative.ge.1) then
write(stdout, '(3X," WARNING: Found ", I3, " Negative Eigenvalues.")'), num_negative
end if
else
end if
energy = energy*nat/nat_sl
!!!!!!!!!!!!!!!!!!!!
! Forces below here. There's going to be some long parallelization business.
!!!!!!!!!!!!!!!!!!!!
call start_clock('mbd_int_forces')
if(.not.allocated(my_dCpqdR)) allocate(my_dCpqdR(nat_sl, nat_sl, n_pairs(me_image+1), 3)), my_dCpqdR = 0.0_DP
if(mbd_vdw_forces) then
do s=1,nat,1
if(me_image.eq.(f_cpu_id(s)+1)) then
do i=1,3,1
temp = MATMUL(TRANSPOSE(Cpq), MATMUL(my_dCpqdR(:, :, counter, i), Cpq))
do j=1,3*nat_sl,1
if(eigenvalues(j).ge.0.0_DP) then
forcedR(s, i) = forcedR(s, i) + 1.0_DP/(2.0_DP*dsqrt(eigenvalues(j)))*temp(j,j)
end if
end do
end do
counter = counter + 1
end if
end do
forcedR = forcedR*nat/nat_sl
do s=1,3,1
do i=1,3,1
temp = MATMUL(TRANSPOSE(Cpq), MATMUL(dCpqdh(:, :, s, i), Cpq))
do j=1,3*nat_sl,1
if(eigenvalues(j).ge.0.0_DP) then
forcedh(s, i) = forcedh(s, i) + 1.0_DP/(2.0_DP*dsqrt(eigenvalues(j)))*temp(j,j)
end if
end do
end do
end do
forcedh = forcedh*nat/nat_sl
call mp_sum(forcedR, intra_image_comm)
call mp_sum(forcedh, intra_image_comm)
end if
call stop_clock('mbd_int_forces')
call stop_clock('mbd_int_energy')
return
end subroutine mbdvdw_interacting_energy
But when run, it's complaining that
[MathBook Pro:58100] *** An error occurred in MPI_Type_create_subarray
[MathBook Pro:58100] *** reported by process [2560884737,2314885530279477248]
[MathBook Pro:58100] *** on communicator MPI_COMM_WORLD
[MathBook Pro:58100] *** MPI_ERR_ARG: invalid argument of some other kind
[MathBook Pro:58100] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[MathBook Pro:58100] *** and potentially your MPI job)
so something is going wrong, but I have no idea what. I know my description is somewhat sparse to start with, so please let me know what information would be necessary to help.