More processors requested than permitted - fortran

I parallelized three nested-loops with MPI. When I ran the code, an error popped up, saying 'srun: error: Unable to create step for job 20258899: More processors requested than permitted'
Here is the script that I used to submit job.
#!/bin/bash
#SBATCH --partition=workq
#SBATCH --job-name="code"
#SBATCH --nodes=2
#SBATCH --time=1:00:00
#SBATCH --exclusive
#SBATCH --err=std.err
#SBATCH --output=std.out
#---#
module switch PrgEnv-cray PrgEnv-intel
export OMP_NUM_THREADS=1
#---#
echo "The job "${SLURM_JOB_ID}" is running on "${SLURM_JOB_NODELIST}
#---#
srun --ntasks=1000 --cpus-per-task=${OMP_NUM_THREADS} --hint=nomultithread ./example_parallel
I paste my code below. Would anyone please tell me what problem is with my code? Is the MPI that I used wrong or not? Thank you very much.
PROGRAM THREEDIMENSION
USE MPI
IMPLICIT NONE
INTEGER, PARAMETER :: dp = SELECTED_REAL_KIND(p=15,r=14)
INTEGER :: i, j, k, le(3)
REAL (KIND=dp), ALLOCATABLE :: kp(:,:,:,:), kpt(:,:), col1(:), col2(:)
REAL (KIND=dp) :: su, co, tot
INTEGER :: world_size, world_rank, ierr
INTEGER :: world_comm_1st, world_comm_2nd, world_comm_3rd
INTEGER :: th3_dimension_size, th3_dimension_size_max, th3_dimension_rank
INTEGER :: th2_dimension_size, th2_dimension_size_max, th2_dimension_rank
INTEGER :: th1_dimension_size, th1_dimension_size_max, th1_dimension_rank
INTEGER :: proc_1st_dimension_len, proc_2nd_dimension_len, proc_3rd_last_len, proc_i, proc_j, proc_k
REAL (KIND=dp) :: t0, t1
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, world_size, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, world_rank, ierr)
IF (world_rank == 0) THEN
t0 = MPI_WTIME()
END IF
le(1) = 1000
le(2) = 600
le(3) = 900
ALLOCATE (kp(le(1),le(2),le(3),3))
ALLOCATE (kpt(le(3),3))
ALLOCATE (col1(le(1)))
ALLOCATE (col2(le(2)))
DO i = 1, le(1), 1
DO j = 1, le(2), 1
DO k = 1, le(3), 1
kp(i,j,k,1) = DBLE(i+j+j+1)
kp(i,j,k,2) = DBLE(i+j+k+2)
kp(i,j,k,3) = DBLE(i+j+k+3)
END DO
END DO
END DO
proc_1st_dimension_len = (world_size - 1) / le(1) + 1
proc_2nd_dimension_len = (world_size - 1 / (le(1) + le(2))) + 1
proc_3rd_last_len = MOD(world_size - 1, le(1)+le(2)) + 1
IF (world_rank <= proc_3rd_last_len*proc_2nd_dimension_len*proc_1st_dimension_len) THEN
proc_i = MOD(world_rank,proc_1st_dimension_len)
proc_j = world_rank / proc_1st_dimension_len
proc_k = world_rank / (proc_1st_dimension_len*proc_2nd_dimension_len)
ELSE
proc_i = MOD(world_rank-proc_3rd_last_len,proc_1st_dimension_len-1)
proc_j = (world_rank-proc_3rd_last_len) / proc_1st_dimension_len-1
proc_k = (world_rank-proc_3rd_last_len) / (proc_2nd_dimension_len*proc_2nd_dimension_len-1)
END IF
CALL MPI_BARRIER(MPI_COMM_WORLD,ierr)
CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,proc_i,world_rank,world_comm_1st,ierr)
CALL MPI_COMM_SIZE(world_comm_1st,th1_dimension_size,ierr)
CALL MPI_COMM_RANK(world_comm_1st,th1_dimension_rank,ierr)
CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,proc_j,world_rank,world_comm_2nd,ierr)
CALL MPI_COMM_SIZE(world_comm_2nd,th2_dimension_size,ierr)
CALL MPI_COMM_RANK(world_comm_2nd,th2_dimension_rank,ierr)
CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,proc_k,world_rank,world_comm_3rd,ierr)
CALL MPI_COMM_SIZE(world_comm_3rd,th3_dimension_size,ierr)
CALL MPI_COMM_RANK(world_comm_3rd,th3_dimension_rank,ierr)
CALL MPI_BARRIER(MPI_COMM_WORLD,ierr)
CALL MPI_ALLREDUCE(th1_dimension_size,th1_dimension_size_max,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD,ierr)
CALL MPI_ALLREDUCE(th2_dimension_size,th2_dimension_size_max,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD,ierr)
IF (world_rank == 0) THEN
OPEN (UNIT=3, FILE='out.dat', STATUS='UNKNOWN')
END IF
col1 = 0.0
DO i = 1, le(1), 1
IF (MOD(i-1,th1_dimension_size_max) /= th1_dimension_rank) CYCLE
col2 = 0.0
DO j = 1, le(2), 1
IF (MOD(j-1,th2_dimension_size_max) /= th2_dimension_rank) CYCLE
kpt = kp(i,j,:,:)
su = 0.0
DO k = 1, le(3), 1
IF(MOD(k-1,th1_dimension_size*th2_dimension_size) /= th3_dimension_rank) CYCLE
CALL CAL(kpt(k,3),co)
su = su + co
END DO
CALL MPI_BARRIER(world_comm_3rd,ierr)
CALL MPI_REDUCE(su,col2(j),1,MPI_DOUBLE,MPI_SUM,0,world_comm_3rd,ierr)
END DO
CALL MPI_BARRIER(world_comm_2nd,ierr)
CALL MPI_REDUCE(col2,col1(i),le(2),MPI_DOUBLE,MPI_SUM,0,world_comm_2nd,ierr)
END DO
CALL MPI_BARRIER(world_comm_1st,ierr)
tot = 0.0
IF (th1_dimension_rank == 0) THEN
CALL MPI_REDUCE(col1,tot,le(1),MPI_DOUBLE,MPI_SUM,0,world_comm_1st,ierr)
WRITE (UNIT=3, FMT=*) tot
CLOSE (UNIT=3)
END IF
DEALLOCATE (kp)
DEALLOCATE (kpt)
DEALLOCATE (col1)
DEALLOCATE (col2)
IF (world_rank == 0) THEN
t1 = MPI_WTIME()
WRITE (UNIT=3, FMT=*) 'Total time:', t1 - t0, 'seconds'
END IF
CALL MPI_FINALIZE (ierr)
STOP
END PROGRAM THREEDIMENSION
SUBROUTINE CAL(arr,co)
IMPLICIT NONE
INTEGER, PARAMETER :: dp=SELECTED_REAL_KIND(p=15,r=14)
INTEGER :: i
REAL (KIND=dp) :: arr(3), co
co = 0.0d0
co = co + (arr(1) ** 2 + arr(2) * 3.1d1) / (arr(3) + 5.0d-1)
RETURN
END SUBROUTINE CAL

With the #SBATCH directives in the header of the file, you request two nodes explicitly, and, as you do not specify --ntasks, you get the default of one task per node, so you implicitly request two tasks.
Then, when the job starts, your srun line tries to "use" 1000 tasks. You should have a line
#SBATCH --ntasks=1000
in the header as suggested per #Gilles. The srun command will inherit from that 1000 tasks by default so there is no need to specify it there in this case.
Also, if ${OMP_NUM_THREADS} were not 1, you would have to specify the --cpu-per-tasks in the header as a SBATCH directive otherwise you will face the same error.

Related

gathering the rank's domain to the rank master to write results

Let's say that you separated your domain according to the number of ranks (with mpi_cart_create) then you have the indices of each domain. Then I want to use gatherv to put back a global array in order to write the results. This code seems to work for 2 or 4 processes but doesn't for 8. I feel like I am missing something. There should be no error message.
Edit : each rank contains its local array. Its elements are equal to the rank of the process (from 0 to n_procs). What I am trying to archieve is to put back the global array with gatherv. This is for a finite volume code. My aim is just to reproduce the domain decomposition and what I desire with gatherv. Here, there should be no -10 in the print and the block should be clearly dissociable with the decomposition done by mpi.
The code works as follow : I separate my domain, compute the delimitation/indices associated to each domain, definite my new type that will be received in the global array, then calculate the displacement and use gatherv.
program scatter
use mpi
implicit none
integer,parameter :: nrows = 8,ncols = 8
integer, dimension(8,8) :: global
integer, dimension(2) :: dims_orig,dims,coo,sizes,subsizes,coo_init
logical,dimension(2) :: periods
integer, dimension(:,:),allocatable :: local
integer, dimension(:),allocatable :: displ,counts
integer :: info,code,rank,n_procs,comm2d
integer :: ndi,ndj,nni,nnj,nni_last,nnj_last
integer :: i1,i2,j1,j2,ii1,ii2,jj1,jj2,row,col
integer :: type0,type_subarray,sizeofint,i,j
integer :: imax,jmax,loc_size
integer(kind=mpi_address_kind) :: start,extend
logical :: reorder
! dimensions of the global array
imax = nrows
jmax = ncols
! default value of the global array
global = -10
call mpi_init(code)
call mpi_comm_rank(mpi_comm_world,rank,code)
call mpi_comm_size(mpi_comm_world,n_procs,code)
dims_orig = 0
CALL MPI_DIMS_CREATE(n_procs ,2,dims_orig,code)
! number of domain in each direction
ndi = dims_orig(1)
ndj = dims_orig(2)
! nni/nnj = sizes of each domain
nni = nrows/ndi
nnj = ncols/ndj
nni_last = nrows-(ndi-1)*nrows/ndi
nnj_last = ncols-(ndj-1)*ncols/ndj
dims(1) = ndi
dims(2) = ndj
periods = .false.
reorder = .true.
call mpi_cart_create(MPI_COMM_WORLD,2,dims,periods,reorder,comm2d,code)
call mpi_comm_rank(comm2d,rank,code)
call mpi_cart_get(comm2d,2,dims,periods,coo,code)
if(coo(1)==ndi-1) then
nni = nni_last
i2 = imax
i1 = i2-nni+1
else
i1 = rank/ndj*nni+1
i2 = i1+nni-1
endif
if(coo(2)==ndj-1) then
nnj = nnj_last
j2 = jmax
j1 = j2-nnj+1
else
j1 = MOD(rank,ndj)*nnj+1
j2 = j1+nnj-1
endif
print*,rank,"|",i1,i2,j1,j2
call mpi_barrier(mpi_comm_world,code)
! create new types
sizes = [imax, jmax]
subsizes = [i2-i1+1,j2-j1+1]
coo_init = [i1-1 ,j1-1 ]
call mpi_type_create_subarray(2,sizes,subsizes,coo_init,&
mpi_order_fortran,MPI_integer,type0,code)
call mpi_type_size(MPI_INTEGER,sizeofint,code)
start = 0
extend = sizeofint*nnj
! call mpi_type_get_extent(type0,start,extend,code)
call MPI_TYPE_CREATE_RESIZED(TYPE0,start,extend,TYPE_SUBARRAY,info)
!type_subarray = type0
call MPI_TYPE_COMMIT(TYPE_SUBARRAY,code)
allocate(displ(ndi*ndj))
! forall(col=1:ndj,row=1:ndi)
! displ(1+(row-1)+(col-1)*ndi) = (row-1)+(col-1)*imax
! ! displ(1+(row-1)*ndi+(col-1)) = (row-1)*ndi+(col-1)
! endforall
! computing the displacement
do i =1,ndi
do j = 1,ndj
displ(1+(i-1)+(j-1)*ndj) = (i-1) + (j-1)*imax
enddo
enddo
allocate(local(i1:i2,j1:j2))
allocate(counts(n_procs))
counts = 1
local = rank
loc_size = (i2-i1+1)*(j2-j1+1)!(i2-i1+1)*(j2-j1+1)
call mpi_gatherv(local,loc_size,MPI_integer,&
global,counts,displ,type_subarray,0,&
MPI_COMM_WORLD,code)
call mpi_barrier(mpi_comm_world,code)
if(rank==0) then
do i = 1,nrows
print*,global(i,:)
enddo
endif
call mpi_finalize(code)
endprogram scatter

Does write statement in Fortran90 affect resulting variable?

I know write statement is just to print the variable. But I have found something in my code.
For example, there are two variables called 'T_avg' and 'T_favg.'
I wanted to print T_favg, I used the line below.
write(*,*) T_favg
result: 100
And then I also wanted to see T_avg, I added variable.
write(*,*) T_avg, T_favg
result: 50, 200
I did not modify the code at all except that line, but the result of T_favg is different.
I used this command line.
ifort.exe /O2 scale_analysis(07022020).f90 -o test.exe
Is there any circumstance occurring it?
**Sorry, I added my code.
PROGRAM scale_analysis
USE, INTRINSIC :: iso_fortran_env, ONLY: real32, real64, FILE_STORAGE_SIZE
IMPLICIT NONE
! DECLARE PARAMETERS
INTEGER, PARAMETER :: SP = real32 ! SINGLE-PRECISION REAL
INTEGER, PARAMETER :: DP = real64 ! DOUBLE-PRECISION REAL
INTEGER, PARAMETER :: nx = 1024 ! TOTAL NUMBER OF GRID POINTS
INTEGER, PARAMETER :: iskip = 2
! DECLARE ALLOCATABLE ARRAYS
REAL(SP), ALLOCATABLE, DIMENSION(:,:,:) :: STMP, T, RHO
! DECLARE VARIABLES FOR SUMMATION
REAL(SP) :: rho_sum, rhoT_sum, T_sum
! DECLARE VARIABLES FOR REYNOLDS-AVERAGING
REAL(SP) :: rho_avg, rhoT_avg, T_avg
! DECLARE VARIABLES FOR FAVRE-AVERAGING
REAL(SP) :: T_favg
INTEGER :: i, j, k
CHARACTER(LEN=100) :: outfile, varname, infile4, infile5
CHARACTER(LEN=100) :: filename
!--------------------------------------------------------------------------
! DEMONSTRATE READ_DATA
!--------------------------------------------------------------------------
ALLOCATE(T(nx/iskip, nx/iskip, nx/iskip))
ALLOCATE(RHO(nx/iskip, nx/iskip, nx/iskip))
ALLOCATE(STMP(nx, nx, nx))
STMP = 0.0_SP
!--------------------------------------------------------------------------
! INITIALIZATION OF SUMMATION VARIABLES
!--------------------------------------------------------------------------
rho_sum = 0.0; rhoT_sum = 0.0;
T_sum = 0.0;
T_avg = 0.0; T_favg = 0.0;
!--------------------------------------------------------------------------
! DATA IMPORT -START
!--------------------------------------------------------------------------
varname = 'Z1_dil_inertHIT'
outfile = 'terms_in_Kolla(Z1_inertHIT).txt'
infile4 = 'E:\AUTOIGNITION\Z1\Temperature_inertHIT.bin'
infile5 = 'E:\AUTOIGNITION\Z1\Density_inertHIT.bin'
OPEN(44, file=TRIM(ADJUSTL(infile4)), status='old', access='stream', &
form='unformatted')
READ(44) stmp
CLOSE(44)
DO k = 1, nx/iskip
DO j = 1, nx/iskip
DO i = 1, nx/iskip
T(i,j,k) = stmp(1+iskip*(i-1), 1+iskip*(j-1), 1+iskip*(k-1))
END DO
END DO
END DO
WRITE(*, '(A)') 'Data4 is successfully read ... '
OPEN(55, file=TRIM(ADJUSTL(infile5)), status='old', access='stream', &
form='unformatted')
READ(55) stmp
CLOSE(55)
DO k = 1, nx/iskip
DO j = 1, nx/iskip
DO i = 1, nx/iskip
RHO(i,j,k) = stmp(1+iskip*(i-1), 1+iskip*(j-1), 1+iskip*(k-1))
END DO
END DO
END DO
WRITE(*, '(A)') 'Data5 is successfully read ... '
DEALLOCATE(stmp)
!--------------------------------------------------------------------------
! COMPUTE U_RMS
!--------------------------------------------------------------------------
DO k = 1, nx/iskip
DO j = 1, nx/iskip
DO i = 1, nx/iskip
rho_sum = rho_sum + RHO(i,j,k)
END DO
END DO
END DO
rho_avg = rho_sum / REAL((nx/iskip)**3)
!--------------------------------------------------------------------------
! COMPUTE TEMPERATRUE TAYLOR LENGTH SCALE(LAMBDA_T)
! C.TOWERY, DETONATION INITIATION BY COMPRESSIBLE TURBULENCE THERMODYNAMIC
! FLUCTUATIONS, CNF, 2019
!--------------------------------------------------------------------------
DO k = 1, nx/iskip
DO j = 1, nx/iskip
DO i = 1, nx/iskip
T_sum = T_sum + T(i,j,k)
rhoT_sum = rhoT_sum + ( RHO(i,j,k)*T(i,j,k) )
END DO
END DO
END DO
T_avg = T_sum / REAL((nx/iskip)**3)
rhoT_avg = rhoT_sum / REAL((nx/iskip)**3)
T_favg = rhoT_avg / rho_avg
DO k = 1, nx/iskip
DO j = 1, nx/iskip
DO i = 1, nx/iskip
rhoT_p_sum = rhoT_p_sum + RHO(i,j,k)*( (T(i,j,k) - T_favg)**2 )
END DO
END DO
END DO
rhoT_p_avg = rhoT_p_sum / REAL((nx/iskip)**3)
T_p = SQRT( rhoT_p_avg / rho_avg )
write(*,*) T_favg
write(*,*) T_avg,T_favg
END PROGRAM scale_analysis
The problematic part is write statement. If I erase T_avg from the second write statement line, then T_favg changes.

running mpi subroutine in fortran program

I want to run a Fortran program which calls a subroutine that I want to parallelize with MPI. I know this sounds complicated, but I want to be able to specify the number of processes for each call. What I would want to use is a structure like this:
program my_program
implicit none
!Define variables
nprocs = !formula for calculating number of processes.
call my_subroutine(output,nprocs,other input vars)
end my_program
I want to run my_subroutine with the same effect as this:
mpirun -n nprocs my_subroutine.o
where my_subroutine has been compiled with 'other input vars.'
Is this possible?
Here is a simple example. I try compiling as follows:
$ mpif90 -o my_program WAVE_2D_FP_TUNER_mpi.f90 randgen.f SIMPLE_ROUTINE.f90
I try to run it like this:
$ mpirun -np (1 or 2) my_program
PROGRAM WAVE_2D_FP_TUNER_mpi
USE MPI
IMPLICIT NONE
REAL(KIND=8) :: T,PARAM(1:3),Z,ZBQLU01
REAL(KIND=8) :: ERRORS,COSTS,CMAX,CMAX_V(1:1000),THRESHOLD,Z_MIN,Z_MAX
REAL(KIND=8) :: U,S,R(1:6),MATRIX(1:15)
INTEGER :: EN,INC,I,J,M,P
INTEGER :: NPROCS,IERR
!0.8,-0.4,0.4,10,4,4,7 -- [0.003,0.534]
!0.8,-0.2,0.2,10,4,4,7 -- [0.190,0.588]
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPROCS,IERR)
THRESHOLD = 0.D0
EN = 81
INC = 1
Z_MIN = -2.D-1; Z_MAX = 2.D-1
T = 1.D0
PARAM(1) = 10.D0; PARAM(2) = 4.D0; PARAM(3) = 4.D0
CMAX = 7.D0 !Max that wave speed could possibly be.
CALL ZBQLINI(0.D0)
OPEN(UNIT = 1, FILE = "TUNER_F.txt")
WRITE(1,*) 'Grid Size: '
WRITE(1,*) T/(EN-1)
DO P = 1,15
S = 0
Z = Z_MIN + (1.d0/(15-1))*dble((P-1))*(Z_MAX - Z_MIN)
WRITE(1,*) 'Z: ',Z
DO I = 1,1000
DO J = 1,6
R(J) = ZBQLU01(0.D0)
END DO
!CALL PDE_WAVE_F_mpi(T,PARAM,R,Z,CMAX,EN,INC,NPROCS,U)
CALL SIMPLE_ROUTINE(T,PARAM,R,Z,CMAX,EN,INC,NPROCS,U)
IF (U<=threshold) THEN
S = S + 1.D0
ELSE
S = S + 0.D0
END IF
END DO
MATRIX(P) = (1.D0/1000)*S
END DO
DO I = 1,15
WRITE(1,*) MATRIX(I)
END DO
PRINT *,MINVAL(MATRIX)
PRINT *,MAXVAL(MATRIX)
CLOSE(1)
CALL MPI_FINALIZE(IERR)
END PROGRAM WAVE_2D_FP_TUNER_mpi
Here is the subroutine that I wish to parallelize with mpi.
SUBROUTINE SIMPLE_ROUTINE(T,PARAM,R,Z,CMAX,EN,INC,NPROCS,U)
! Outputs scalar U = T*Z*CMAX*INC*SUM(PARAM)*SUM(R)*SUM(Y)
USE MPI
IMPLICIT NONE
REAL(KIND=8), INTENT(IN) :: T,PARAM(1:3),R(1:6),Z,CMAX
INTEGER, INTENT(IN) :: EN,INC
INTEGER, INTENT(IN) :: NPROCS
REAL(KIND=8), INTENT(OUT) :: U
REAL(KIND=8) :: H,LOCAL_SUM,SUM_OF_X
REAL(KIND=8), DIMENSION(:), ALLOCATABLE :: X
INTEGER :: PX,PX_MAX,NXL,REMX,IX_OFF,P_LEFT,P_RIGHT
INTEGER :: J
INTEGER :: IERR,MYID
! Broadcast nprocs handle to all processes in MPRI_COMM_WORLD
CALL MPI_BCAST(&NPROCS, NPROCS, MPI_INT, 0, MPI_COMM_WORLD,IERR)
! Create subcommunicator SUBCOMM (Do not know how to define WORLD_GROUP?)
CALL MPI_COMM_SPLIT(MPI_COMM_WORLD,WORLD_GROUP,SUBCOMM,IERR)
! Assign IDs to processes in SUBCOMM
CALL MPI_COMM_RANK(SUBCOMM,MYID,IERR)
! Give NPROCS - 1 to SUBCOMM
CALL MPI_COMM_SIZE(SUBCOMM,NPROCS-1,IERR)
H = 2.D0/(EN-1)
! LABEL THE PROCESSES FROM 1 TO PX_MAX.
PX = MYID + 1
PX_MAX = NPROCS
! SPLIT UP THE GRID IN THE X-DIRECTION.
NXL = EN/PX_MAX !nxl = 10/3 = 3
REMX = EN-NXL*PX_MAX !remx = 10-3*3 = 1
IF (PX .LE. REMX) THEN !for px = 1,nxl = 3
NXL = NXL+1 !nxl = 4
IX_OFF = (PX-1)*NXL !ix_off = 0
ELSE
IX_OFF = REMX*(NXL+1)+(PX-(REMX+1))*NXL !for px = 2 and px = 3, ix_off = 1*(3+1)+(2-(1+1))*3 = 4, ix_off = 1*(3+1)+(3-(1+1))*3 = 7
END IF
! ALLOCATE MEMORY FOR VARIOUS ARRAYS.
ALLOCATE(X(0:NXL+1))
X(:) = (/(-1.D0+DBLE(J-1+IX_OFF)*H, J=1,EN)/)
LOCAL_SUM = SUM(X(1:NXL))
CALL MPI_REDUCE(LOCAL_SUM,SUM_OF_X,1,&
MPI_DOUBLE_PRECISION,MPI_SUM,&
0,MPI_COMM_WORLD,IERR)
U = T*Z*CMAX*INC*SUM(PARAM)*SUM(R)*SUM_OF_X
DEALLOCATE(X)
CALL MPI_COMM_FREE(SUBCOMM,IERR)
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
END SUBROUTINE SIMPLE_ROUTINE
Ultimately, I want to be able to change the number of processors used in the subroutine, where I want nprocs to be calculated from the value of EN.
A simple approach is to start the MPI app with the maximum number of processes.
Then my_subroutine will first MPI_Bcast(&nprocs, ...) and MPI_COMM_SPLIT(MPI_COMM_WORLD, ..., &subcomm) in order to create a sub communicator subcomm with nprocs
(you can use MPI_UNDEFINED so the "other" communicator will be MPI_COMM_NULL.
Then the MPI tasks that are part of subcomm will perform the computation.
Finally, MPI_Comm_free(&subcomm) and MPI_Barrier(MPI_COMM_WORLD)
From a performance point of view, note sub-communicator creation can be expensive, but hopefully not significant compared to the computation time.
If not, you'd rather revamp your algorithm so it can have nprocs tasks do the job, and the other ones waiting.
An other approach would be to start your app with one MPI task, MPI_Comm_spawn() nprocs-1 tasks, merge the inter-communicator, perform the computation, and terminates the spawned tasks.
The overhead of task creation is way more important, and this might not be fully supported by your resource manager, so I would not advise this option.

Reading data from files using MPI in Fortran

I want to read data from some .dat files to a Fortran code for postprocessing. As a test case, I am just using one processor for MPI and trying to read single data file to my code. The content of the data file is as follows:
qout0050.dat : 1 1 1
However, the matrix (Vn in this case) which is supposed to store the content of this data file shows all 0 values. The relevant part of the code which reads from data file and store to the matrix is as follows:
subroutine postproc()
use precision_mod
use mpicomms_mod
implicit none
integer(kind=MPI_Offset_kind) :: i, j , igrid, k, l, disp, iproc, info, lwork
integer :: rst, numvar, ifile, number, num, step, ntot
integer :: Nx_max, Ny_max, Nz_max
integer :: Nxp, Nzp, Nyp, Ngrid
integer :: Ifirst, Ilast, Jfirst, Jlast, Kfirst, Klast
character*(64) :: fname, buffer, ffname
integer :: tmp, N1, N2, N3, tmpp
real(WP), allocatable :: qout(:,:,:,:), phi_xyz(:,:,:,:,:)
real(WP), allocatable :: Vn(:,:), Vntmp(:,:), TAU(:,:)
real(WP), allocatable :: Rprime(:,:), Rtmp(:,:), Rend(:,:), Q(:,:), tmpL(:), tmpG(:)
real(WP), allocatable :: s(:), vt(:,:), u(:,:), utmp(:,:)
real(WP), allocatable :: tmp1(:,:), tmp2(:,:), tmp3(:,:), phi(:,:)
integer, dimension(2) :: view
integer :: view1
integer, dimension(3) :: lsizes, gsizes, start
real(WP) :: tmpr
real(WP), allocatable :: work(:), mu(:), eigY(:,:), wr(:), wi(:), beta(:)
integer(kind=MPI_Offset_kind) :: SP_MOK, Nx_MOK, Ny_MOK, Nz_MOK, WP_MOK
open(unit=110,file='postparameters.dat',form="formatted")
read (110,*) Nx_max
read (110,*) Ny_max
read (110,*) Nz_max
read (110,*) numvar
read (110,*) step
close(110)
! Define the size of grid on each processor
if (mod(Nx_max,px).ne.0) then
write(*,*) 'Error in preproc: Nx_max is not devisable by px'
call MPI_ABORT(MPI_COMM_WORLD,0,ierr)
end if
Nxp = Nx_max/px
if (mod(Ny_max,py).ne.0) then
write(*,*) 'Error in preproc: Nx_max is not devisable by px'
call MPI_ABORT(MPI_COMM_WORLD,0,ierr)
end if
Nyp = Ny_max/py
if (mod(Nz_max,pz).ne.0) then
write(*,*) 'Error in preproc: Nx_max is not devisable by px'
call MPI_ABORT(MPI_COMM_WORLD,0,ierr)
end if
Nzp = Nz_max/pz
Ifirst = irank*Nxp + 1
Ilast = Ifirst + Nxp - 1
Jfirst = jrank*Nyp + 1
Jlast = Jfirst + Nyp - 1
Kfirst = krank*Nzp + 1
Klast = Kfirst + Nzp - 1
! Setting the view for phi
gsizes(1) = Nx_max
gsizes(2) = Ny_max
gsizes(3) = Nz_max
lsizes(1) = Nxp
lsizes(2) = Nyp
lsizes(3) = Nzp
start(1) = Ifirst - 1
start(2) = Jfirst - 1
start(3) = Kfirst - 1
call MPI_TYPE_CREATE_SUBARRAY(3,gsizes,lsizes,start,&
MPI_ORDER_FORTRAN,MPI_REAL_SP,view,ierr)
call MPI_TYPE_COMMIT(view,ierr)
call MPI_TYPE_CREATE_SUBARRAY(3,gsizes,lsizes,start,&
MPI_ORDER_FORTRAN,MPI_REAL_WP,view1,ierr)
call MPI_TYPE_COMMIT(view1,ierr)
WP_MOK = int(8, MPI_Offset_kind)
Nx_MOK = int(Nx_max, MPI_Offset_kind)
Ny_MOK = int(Ny_max, MPI_Offset_kind)
Nz_MOK = int(Nz_max, MPI_Offset_kind)
! Reading the qout file
ffname = 'qout'
allocate(qout(Nxp,Nyp,Nzp,numvar))
allocate(Vn(Nxp*Nyp*Nzp*numvar,step))
do rst = 1,step
if (myrank == 0) print*, 'Step = ', 50 + rst -1
write(buffer,"(i4.4)") 50 + rst -1
fname = trim(ffname)//trim(buffer)
fname = trim('ufs')//":"// trim(fname)
fname = trim(adjustl(fname))//'.dat'
call MPI_FILE_OPEN(MPI_COMM_WORLD,fname,MPI_MODE_RDONLY,MPI_INFO_NULL,ifile,ierr)
call MPI_FILE_READ(ifile,Ngrid,1,MPI_INTEGER,status,ierr)
if (1 /= Ngrid) then
if (myrank == 0 ) write(*,*) Ngrid
endif
call MPI_FILE_READ(ifile,tmp,1,MPI_INTEGER,status,ierr)
if (tmp /= Nx_max) write(*,*) tmp
call MPI_FILE_READ(ifile,tmp,1,MPI_INTEGER,status,ierr)
if (tmp /= Ny_max) write(*,*) tmp
call MPI_FILE_READ(ifile,tmp,1,MPI_INTEGER,status,ierr)
if (tmp /= Nz_max) write(*,*) tmp
call MPI_FILE_READ(ifile,tmpr,1,MPI_REAL_WP,status,ierr)
call MPI_FILE_READ(ifile,tmpr,1,MPI_REAL_WP,status,ierr)
call MPI_FILE_READ(ifile,tmpr,1,MPI_REAL_WP,status,ierr)
call MPI_FILE_READ(ifile,tmpr,1,MPI_REAL_WP,status,ierr)
do l=1,numvar
disp = 4*4 + 4*WP_MOK + Nx_MOK*Ny_MOK*Nz_MOK*WP_MOK*(l-1)
call MPI_FILE_SET_VIEW(ifile,disp,MPI_REAL_WP,view1,"native",MPI_INFO_NULL,ierr)
call MPI_FILE_READ_ALL(ifile,qout(1:Nxp,1:Nyp,1:Nzp,l),Nxp*Nzp*Nyp, MPI_REAL_WP,status,ierr)
end do
call MPI_FILE_CLOSE(ifile,ierr)
!-----------------------------------------------
! Bluiding the snapshot matrix Vn --------------
!-----------------------------------------------
do i=1,numvar
do k=1,Nzp
do j=1,Nyp
Vn((1 + Nxp*(j-1) + Nxp*Nyp*(k-1) + Nxp*Nyp*Nzp*(i-1)):(Nxp*j + Nxp*Nyp*(k-1) + Nxp*Nyp*Nzp*(i-1)),rst) = qout(1:Nxp,j,k,i)
end do
end do
end do
end do
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
deallocate(qout)

Fortran Error - attempt to call a routine with argument number three as a real(kind =1) when procedure was required

I have a fortran project linked to various subroutines, which are called from the main program. Variables are passed using modules. I can compile the code without any error. When I run the code, during the subroutine call i get an error "attempt to call a routine with argument number three as a real(kind =1) when procedure was required. I am not sure where i am going wrong. Can someone point out the error? Your help is very much appreciated. The error appears when the subroutine 'ncalc' is called inside the loop
program partbal
use const
use times
use density
use parameters
use rateconst
use ploss
implicit none
integer :: i
real :: nclp_init, ncl2p_init, ncln_init, ne_init
real :: ncl_init, ncl2_init, Te_init, neTe_init
open (10,file='in.dat')
read (10,*)
read (10,*) pressure
read (10,*)
read (10,*) P, pfreq, duty
read (10,*)
read (10,*) nclp_init, ncl2p_init, ncln_init, Te_init, Ti
pi = 3.14159265
R = 0.043
L = 0.1778
Al = 2*pi*R*R
Ar = 2*pi*R*L
V = pi*R*R*L
S = 0.066
e = 1.6e-19
me = 9.1e-31
mCl = 35.5/(6.023e26)
mCl2 = 2*mCl
k = 1.3806e-23
vi = (3*Ti*e/(53.25/6.023e26))**0.5
ncl2_init = pressure*0.1333/(1.3806e-23*298)/2
ncl_init = ncl2_init
ne_init = nclp_init + ncl2p_init - ncln_init
tot_time = 1/(pfreq*1000)
off_time = duty*tot_time
npoints = 10000
dt = tot_time/npoints
neTe_init = ne_init*Te_init
t_step = 0
call kcalc(Te_init)
call param(nclp_init, ncl2p_init, ncln_init, ne_init, ncl_init,ncl2_init, Te_init, Ti)
do i = 1, npoints, 1
t_step = i*dt + t_step
if (t_step > 0 .and. t_step <= 500) then
Pabs = 500
else if (t_step > 500) then
Pabs = 0
end if
if (i <= 1) then
call ncalc(ne_init, ncl_init, ncl2_init, nclp_init, ncln_init, ncl2p_init)
call powerloss(ne_init, ncl_init, ncl2_init, Pabs, neTe_init)
Te = neTe/ne
call kcalc(Te)
call param(nclp, ncl2p, ncln, ne, ncl, ncl2, Te, Ti)
else
call ncalc(ne, ncl, ncl2, nclp, ncln, ncl2p)
call powerloss(ne, ncl, ncl2, Pabs, neTe)
Te = neTe/ne
call kcalc(Te)
call param(nclp, ncl2p, ncln, ne, ncl, ncl2, Te, Ti)
end if
!open( 70, file = 'density.txt' )
!open( 80, file = 'Te.txt')
!do i = 1, 1001, 1
! np(i) = ncl2p(i) + nclp(i)
!write (70, *) ncl(i), ncl2(i), ncl2p(i), nclp(i), np(i), ncln(i), ne(i)
!close(70)
!write (80, *) Te(i), phi(i)
!close(80)
!end do
end do
end program partbal
subroutine ncalc(n_e, n_cl, n_cl2, n_clp, n_cln, n_cl2p)
use parameters
use const
use density
use rateconst
use times
implicit none
real :: n_e, n_cl, n_cl2, n_clp, n_cln, n_cl2p
nclp = (((kCliz*n_e*n_cl)+((kpair+kdisiz)*n_e*n_cl2)-(5e-14*n_clp*n_cln)-(S*n_clp/V)-((hlclp*Al+hrclp*Ar)*n_clp*ubclp))*dt)+n_clp
ncl2p = (((kCl2iz*n_e*n_cl2)-(5e-14*n_cl2p*n_cln) - (((hlcl2p*Al + hrcl2p*Ar)*n_cl2p*ubcl2p)/V)-(S*n_cl2p/V))*dt)+n_cl2p
ncln = ((((katt+kpair)*n_e*n_cl2)-(5e-14*n_clp*n_cln)-(5e-14*n_cl2p*n_cln)-(kdet*n_e*n_cln)-(S*n_cln/V)-(taun*(Al+Ar)/V))*dt)+n_cln
ne = ncl2p+nclp-ncln
ncl = ((((2*kdis+katt+kdisiz)*n_e*n_cl2)-(kCliz*n_e*n_cl)+(5e-14*n_cl2p *n_cln)+(2*5e-14*n_clp*n_cln)+ (kdet*n_e*n_cln) - (300*n_cl) + ((hlclp*Al + hrclp*Ar)*n_clp*ubclp/V)-(S*n_cl/V))*dt)+n_cl
ncl2 = ((n_cl2(1) + (5e-14*n_cl2p*n_cln) - ((kCl2iz+kdis+katt+kpair+kdisiz)*n_e*n_cl2) + (0.5*300*n_cl) + ((hlcl2p*Al + hrcl2p*Ar)*n_cl2p*ubcl2p/V)-(S*n_cl2/V))*dt)+n_cl2
return
end subroutine ncalc
In the line in subroutine ncalc immediately before the return statement, you have a reference to n_cl2(1) very early in the right hand side of the assignment statement. n_cl2 has not been declared as an array, therefore the compiler assumes that it must be a reference to a function that takes a single default integer argument. Because n_cl2 is a dummy argument, the it then expects you to provide a function for the corresponding actual argument when the routine is called.
(How your compiler manages to compile the preceding references to n_cl2 is a bit of a mystery - I suspect this error violates the syntax rules and hence you should see some sort of compile time diagnostic.)
Given you are using modules, it seems odd that you have not placed the ncalc routine in a module. If you did so, the error would probably become a compile time error rather than a runtime.