MPI_ALLREDUCE fails using many cores - fortran

I have a code that I have checked using one core and it works perfectly. Even with valgrind I obtain no possible memory leaks. However, when the code (the module were it fails):
module m_solv3Dahv_dir_adj_all
implicit none
contains
subroutine solv3Dahv_dir_adj_all( n_sg, shgat, buoy, cmpr, freq_0_inv, freq_inv, error_sum, grad_sum, illu_sum )
use m_geo, only: nx, ny, nz
implicit none
#ifdef mpi
include 'mpif.h'
#endif
integer, allocatable, intent(in) :: n_sg(:)
type(shot_gather), allocatable, intent(inout) :: shgat(:)
real(rl), intent(in) :: freq_0_inv, freq_inv
real(rl), intent(out) :: error_sum
real(rl), allocatable, intent(inout) :: grad_sum(:,:,:), illu_sum(:,:,:)
real(rl), allocatable, intent(in) :: cmpr(:,:,:), buoy(:,:,:)
integer :: ix, iy, iz
real(rl), allocatable :: v_aux(:)
...
allocate( v_aux(nx), stat=stal ); if ( stal/=0 ) stop 'AE solv3Dahv_dir_adj_all 1'
...
#ifdef mpi
! Sum gradient for each process.
do iy=1,ny
do iz=1,nz
v_aux = grad_sum(:,iy,iz)
call MPI_ALLREDUCE( v_aux, grad_sum(:,iy,iz), nx, MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD, ierr )
end do
end do
#endif
...
uses the MPI parts to sum the values of the allocatable grad_sum, it breaks with segmentation faults. I couldn't explain it, so I tested the code with valgrind and surprisingly the code runs up to the end (I have only reduce the number of iteration since otherwise it will never finish), but valgrind complains many times about the MPI_ALLREDUCE. I have copied bellow the first valgrind complain (there are hundreds but are all similar). I do not have idea of the problem, I guess it can be some problem related with MPI because the code runs when I use few cores (three), however, with 20 cores it breaks. The dimensions of grad_sum for the tests is small (nx=413, ny=351 and nz=1), so that it can not be a problem with the RAM. Is there any problem to send many MPI_ALLREDUCE inside a loop with many cores?
==32173== Use of uninitialised value of size 8
==32173== at 0x5421330: btl_openib_handle_incoming (btl_openib_component.c:3092)
==32173== by 0x5422A0A: handle_wc (btl_openib_component.c:3401)
==32173== by 0x5422A0A: poll_device (btl_openib_component.c:3541)
==32173== by 0x5422A0A: progress_one_device (btl_openib_component.c:3649)
==32173== by 0x5422A0A: btl_openib_component_progress (btl_openib_component.c:3674)
==32173== by 0x557ADF9: opal_progress (opal_progress.c:207)
==32173== by 0x53D53B4: opal_condition_wait (condition.h:99)
==32173== by 0x53D53B4: ompi_request_default_wait_all (req_wait.c:263)
==32173== by 0x543933E: ompi_coll_tuned_allreduce_intra_recursivedoubling (coll_tuned_allreduce.c:223)
==32173== by 0x53E18FB: PMPI_Allreduce (pallreduce.c:105)
==32173== by 0x5137D42: PMPI_ALLREDUCE (pallreduce_f.c:77)
==32173== by 0x4F0CAF: __m_solv3dahv_dir_adj_all_MOD_solv3dahv_dir_adj_all (in /data/PROCESSED_DATA/dagnino/01-code/Nicaragua/inv_str2/fwi)
==32173== by 0x504D03: __m_fwi_MOD_fwi (in /data/PROCESSED_DATA/dagnino/01-code/Nicaragua/inv_str2/fwi)
==32173== by 0x50C44E: MAIN__ (in /data/PROCESSED_DATA/dagnino/01-code/Nicaragua/inv_str2/fwi)
==32173== by 0x549EE9: main (in /data/PROCESSED_DATA/dagnino/01-code/Nicaragua/inv_str2/fwi)
==32173== Uninitialised value was created by a heap allocation
==32173== at 0x4A05DBC: memalign (vg_replace_malloc.c:858)
==32173== by 0x4A05E57: posix_memalign (vg_replace_malloc.c:1021)
==32173== by 0x54837B8: mca_mpool_rdma_alloc (mpool_rdma_module.c:103)
==32173== by 0x53C0E9F: ompi_free_list_grow (ompi_free_list.c:203)
==32173== by 0x541D088: prepare_device_for_use (btl_openib_component.c:1217)
==32173== by 0x542096E: btl_openib_component_init (btl_openib_component.c:2902)
==32173== by 0x5405E53: mca_btl_base_select (btl_base_select.c:111)
==32173== by 0x5405191: mca_bml_r2_component_init (bml_r2_component.c:85)
==32173== by 0x54039C8: mca_bml_base_init (bml_base_init.c:71)
==32173== by 0x54A3CDF: mca_pml_ob1_component_init (pml_ob1_component.c:181)
==32173== by 0x549CB73: mca_pml_base_select (pml_base_select.c:132)
==32173== by 0x53D6219: ompi_mpi_init (ompi_mpi_init.c:522)

Related

Fortran manage local array memory

This is more of a best practice on Fortran code writing other than solving an error.
I have this following code sample with some large array that needs to be passed around to some subroutine for some calculation
program name
implicit none
integer, parameter:: n = 10**8
complex(kind=8) :: x(n)
integer :: i, nVal
nVal = 30
do i =1,1000
call test(x,nVal)
!-----other calculations-----!
! after every step nVal chnages, and after few step nVal converges
! e.g. `nVal` starts from 30 and converges at 14, after 10-15 steps, and stays there for rest of the loop
! once `nVal` converges the `workarray` requires much less memory than it requires at the starts
enddo
contains
subroutine test(arr,m)
integer , intent(inout) :: m
complex(kind=8), intent(inout) :: arr(n)
complex(kind=8) :: workarray(n,m) ! <-- large workspace
!----- do calculation-----------!
!--- check convergence of `m`----!
end
end program name
The internal workarray depends on a value that decreases gradually and reaches a convergence, and stays there for rest of the code. If I check the memory usage with top it shows at 27% from starts to finish. But after few steps the memory requirement should decrease too.
So, I modified the code to use allocatable workarray like this,
program name
implicit none
integer, parameter:: n = 10**8
complex(kind=8) :: x(n)
integer :: i, nVal, oldVal
complex(kind=8), allocatable :: workarray(:,:)
nVal = 30
oldVal = nVal
allocate(workarray(n,nVal))
do i =1,1000
! all calculation of the subroutine `test` brought to this main code
!--- check convergence of `nVal`----!
if(nVal /= oldVal) then
deallocate(workarray)
allocate(workarray(n,nVal))
oldVal = nVal
endif
enddo
end program name
Now, If I use top the memory usage starts at about 28% and then decreases and reaches a converged value of 19%.
Now, my question is how should I code situations like this. The allocatable option do decreases memory requirement but it also hampers the code readability a little bit and introduces code duplication in several places. On the other hand, the prior option keeps larger memory for the whole time where much less memory would suffice. So, what is preferred way of coding in this situation?
I can't help you decide which of the two methods is better; it will depend on how you (or the users of your code) value the potential tradeoff between memory use and cpu use. However, I can suggest a better version of your second method.
Rather than passing workarray in and out of test, you can keep it local to test and use the save attribute to make it persistent between procedure calls.
This would look something like
program name
implicit none
integer, parameter :: dp = selected_real_kind(15,300)
integer, parameter:: n = 10**8
complex(dp) :: x(n)
integer :: i, nVal
nVal = 30
do i =1,1000
call test(x,nVal)
enddo
contains
subroutine test(arr,m)
complex(dp), intent(inout) :: arr(:)
integer, intent(inout) :: m
! Initialise workarray to an empty array
! Avoids having to check if it is allocated each time
complex(dp), allocatable, save :: workarray(:,:) = reshape([complex(dp)::], [0, 0])
! Reallocate workarray if necessary.
if (size(workarray, 2)<m) then
deallocate(workarray)
allocate(workarray(size(arr), m))
endif
end subroutine
end program
If m is likely to increase slowly, you may also want to consider replacing allocate(workarray(size(arr), m)) with allocate(workarray(size(arr), 2*m)), such that you get c++ std::vector-style memory management.
The main downside of this approach (besides not reducing the memory use) is that you need to be more careful if you want to run parallel code which uses procedures with saved variables.

Possible bug in ifort

While trying to use a type for blockdiagonal matrices in Fortran code,
I stumbled over a surprising bug in the following piece of code using the following compilers:
GNU Fortran (SUSE Linux) 7.4.0
ifort (IFORT) 18.0.5 20180823
ifort (IFORT) 16.0.1 20151021
If I compile
gfortran -Wall -Werror --debug ifort_bug.f && valgrind ./a.out
I have no errors reported from valgrind.
If I compile
ifort -warn all,error -debug -stacktrace ifort_bug.f && valgrind ./a.out
I get a segmentation fault for ifort_18 in my code and "only" memory leaks for ifort_16.
Is this a bug in the intel compilers, or does gfortran silently fix my bad code?
module blockdiagonal_matrices
implicit none
private
public :: t_blockdiagonal, new, delete,
& blocksizes, operator(.mult.), mult
save
integer, parameter :: dp = kind(1.d0)
type :: t_blockdiagonal
real(dp), allocatable :: block(:, :)
end type
interface new
module procedure block_new
end interface
interface delete
module procedure block_delete
end interface
interface operator (.mult.)
module procedure mult_blocks
end interface
contains
subroutine block_new(blocks, blocksizes)
type(t_blockdiagonal), intent(out) :: blocks(:)
integer, intent(in) :: blocksizes(:)
integer :: i, L
do i = 1, size(blocks)
L = blocksizes(i)
allocate(blocks(i)%block(L, L))
end do
end subroutine
subroutine block_delete(blocks)
type(t_blockdiagonal) :: blocks(:)
integer :: i
do i = 1, size(blocks)
deallocate(blocks(i)%block)
end do
end subroutine
function blocksizes(A) result(res)
type(t_blockdiagonal), intent(in) :: A(:)
integer :: res(size(A))
integer :: i
res = [(size(A(i)%block, 1), i = 1, size(A))]
end function
function mult_blocks(A, B) result(C)
type(t_blockdiagonal), intent(in) :: A(:), B(:)
type(t_blockdiagonal) :: C(size(A))
integer :: i
call new(C, blocksizes=blocksizes(A))
do i = 1, size(A)
C(i)%block = matmul(A(i)%block, B(i)%block)
end do
end function
subroutine mult(A, B, C)
type(t_blockdiagonal), intent(in) :: A(:), B(:)
type(t_blockdiagonal) :: C(:)
integer :: i
do i = 1, size(A)
C(i)%block = matmul(A(i)%block, B(i)%block)
end do
end subroutine
end module blockdiagonal_matrices
program time_blockdiagonal
use blockdiagonal_matrices
integer, parameter :: n_blocks = 2, L_block = 10**2
type(t_blockdiagonal) :: A(n_blocks), B(n_blocks), C(n_blocks)
integer :: i
integer :: start, finish, rate
call system_clock(count_rate=rate)
call new(A, blocksizes=[(L_block, i = 1, n_blocks)])
call new(B, blocksizes=[(L_block, i = 1, n_blocks)])
call new(C, blocksizes=[(L_block, i = 1, n_blocks)])
do i = 1, n_blocks
call random_number(A(i)%block)
call random_number(B(i)%block)
end do
call system_clock(start)
C = A .mult. B
call system_clock(finish)
write(6,*) 'Elapsed Time in seconds:',
& dble(finish - start) / dble(rate)
call system_clock(start)
call mult(A, B, C)
call system_clock(finish)
write(6,*) 'Elapsed Time in seconds:',
& dble(finish - start) / dble(rate)
call delete(A)
call delete(B)
call delete(C)
end program time_blockdiagonal
The ifort_18 segmentation fault is:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
a.out 00000000004134BD Unknown Unknown Unknown
libpthread-2.26.s 00007FF720EB6300 Unknown Unknown Unknown
a.out 000000000040ABB8 Unknown Unknown Unknown
a.out 000000000040B029 Unknown Unknown Unknown
a.out 0000000000407113 Unknown Unknown Unknown
a.out 0000000000402B4E Unknown Unknown Unknown
libc-2.26.so 00007FF720B0AF8A __libc_start_main Unknown Unknown
a.out 0000000000402A6A Unknown Unknown Unknown
When I went into it with gdb it turned out, that the segfault is raised upon returning from function mult_blocks.
blockdiagonal_matrices::mult_blocks (c=..., a=..., b=...) at ifort_bug.f:63
63 do i = 1, size(A)
(gdb) s
64 C(i)%block = matmul(A(i)%block, B(i)%block)
(gdb) s
s
s
65 end do
(gdb) s
64 C(i)%block = matmul(A(i)%block, B(i)%block)
(gdb) s
65 end do
(gdb) s
66 end function
(gdb) s
Program received signal SIGSEGV, Segmentation fault.
0x000000000040abc4 in do_deallocate_all ()
(gdb) q
A debugging session is active.
Even with this information I cannot find a bug in my code.
EDIT
I found a fix, although I don't understand why it works.
Compilation fix
If I use -heap-arrays for compilation it works. So at first glance it seems to run into a Stackoverflow. If I do ulimit -s unlimited it does not solve the problem though.
Code fix
If I explicitly allocate in the code it solves the issue.
subroutine new(blocks, blocksizes)
type(t_blockdiagonal), allocatable, intent(out) :: blocks(:)
integer, intent(in) :: blocksizes(:)
integer :: i, L
allocate(blocks(size(blocksizes)))
do i = 1, size(blocks)
L = blocksizes(i)
allocate(blocks(i)%block(L, L))
end do
end subroutine
subroutine delete(blocks)
type(t_blockdiagonal), allocatable :: blocks(:)
integer :: i
do i = 1, size(blocks)
deallocate(blocks(i)%block)
end do
deallocate(blocks)
end subroutine
function mult_blocks(A, B) result(C)
type(t_blockdiagonal), intent(in) :: A(:), B(:)
type(t_blockdiagonal), allocatable :: C(:)
integer :: i
call new(C, blocksizes=blocksizes(A))
do i = 1, size(A)
C(i)%block = matmul(A(i)%block, B(i)%block)
end do
end function
This way of writing is IMHO actually better than before and not a "dirty hack".
It is not anymore possible to call new with a differing size of blocksize and blocks.
Open question
Speaking C the type(t_blockdiagonal) :: blocks(n)
Should be a float** blocks[n] so just a vector of pointers to pointers.
The allocation of the actual blocks happened also in the first version on the heap. Hence I don't get the Stackoverflow for a vector that contains ca. 10 pointers.
IntelĀ® Fortran Compiler - Increased stack usage of 8.0 or higher compilers causes segmentation fault (via archive.org)
According to this article, ulimit -s unlimited does not mean that the stack will be literally "unlimited":
The size of "unlimited" varies by Linux configuration, so you may need to specify a larger, specific number to ulimit (for example, 999999999). On Linux also note that many 32bit Linux distributions ship with a pthread static library (libpthread.a) that at runtime will fix the stacksize to 2093056 bytes regardless of the ulimit setting.
It is most likely that you do indeed have a stack overflow, given that -heap-arrays solves the issue.

MPI_WIN_ALLOCATE_SHARED: memory limited?

It seems like whenever I am trying to allocate a window around 30-32 Mb I get a segmentation fault?
I am using following routine MPI_WIN_ALLOCATE_SHARED
Does anybody know if there is a limit to how big my window can be? If so, is there a way to compile my code relaxing that limit?
I am using INTEL MPI 19.0.3 and ifort 19.0.3 -
Example written in Fortran. By varying the integer size_ you can see when the segmentation fault occurs. I tested it with size_=10e3 and size_=10e4 the latter caused a segmentation fault
C------
program TEST_STACK
use, INTRINSIC ::ISO_C_BINDING
implicit none
include 'mpif.h'
!--- Parameters (They should not be changed ! )
integer, parameter :: whoisroot = 0 ! - Root always 0 here
!--- General parallel
integer :: whoami ! - My rank
integer :: mpi_nproc ! - no. of procs
integer :: mpierr ! - Error status
integer :: status(MPI_STATUS_SIZE)! - For MPI_RECV
!--- Shared memory stuff
integer :: whoami_shm ! - Local rank in shared memory group
integer :: mpi_shm_nproc ! - No. of procs in Shared memory group
integer :: no_partners ! - No. of partners for share memory
integer :: info_alloc
!--- MPI groups
integer :: world_group ! - All procs across all nodes
integer :: shared_group ! - Only procs that share memory
integer :: MPI_COMM_SHM ! - Shared memory communicators (for those in shared_group)
type(C_PTR) :: ptr_buf
integer(kind = MPI_ADDRESS_KIND) :: size_bytes, lb
integer :: win, size_, disp_unit
call MPI_INIT ( mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, mpi_nproc, mpierr)
call MPI_COMM_SPLIT_TYPE( MPI_COMM_WORLD
& , MPI_COMM_TYPE_SHARED
& , 0
& , MPI_INFO_NULL
& , MPI_COMM_SHM
& , mpierr )
call MPI_COMM_RANK( MPI_COMM_SHM, whoami_shm, mpierr )
call MPI_COMM_SIZE( MPI_COMM_SHM, mpi_shm_nproc, mpierr )
size_ = 10e4! - seg fault
size_bytes = size_ * MPI_REAL
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
call MPI_INFO_CREATE( info_alloc, mpierr )
call MPI_INFO_SET( info_alloc
& , "alloc_shared_noncontig"
& , "true"
& , mpierr )
!
call MPI_WIN_ALLOCATE_SHARED( size_bytes
& , disp_unit
& , info_alloc
& , MPI_COMM_SHM
& , ptr_buf
& , win
& , mpierr )
call MPI_WIN_FREE(win, mpierr)
end program TEST_STACK
I run my code using following command
mpif90 test_stack.f90; mpirun -np 2 ./a.out
This wrapper is linked to my ifort 19.0.3 and Intel MPI library. This has been verified by running
mpif90 -v
and to be very precise my mpif90 is a symbolic link to my mpiifort wrapper. This is made for personal convenience but shouldn't be causing problems I guess?
The manual says that the call to MPI_WIN_ALLOCATE_SHARED looks like this
USE MPI
MPI_WIN_ALLOCATE_SHARED(SIZE, DISP_UNIT, INFO, COMM, BASEPTR, WIN, IERROR)
INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR
INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR
At least the types of disp_unit and baseptr do not match in your program.
I was finally able to diagnose where the error stems from.
In the code I have
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
MPI_REAL is a constant/parameter defined by MPI and is not equal to 4 as I very wrongly expected (4 for 4 bytes for single precision)!. In my version it is set to 1275069468 which most likely refers to an id rather than any sensible number.
Hence, multiplying this number with the size of my array can very quickly exceeds the available memory, but and also the number of digits that can be represented by a byte integer

Loop in fortran with openmp and allocatable arrays

I like to do this:
program main
implicit none
integer l
integer, allocatable, dimension(:) :: array
allocate(array(10))
array = 0
!$omp parallel do private(array)
do l = 1, 10
array(l) = l
enddo
!$omp end parallel do
print *, array
deallocate(array)
end
But I am running into error messages:
* glibc detected * ./a.out: munmap_chunk(): invalid pointer: 0x00007fff25d05a40 *
This seems to be a bug in ifort according to some discussions at intel forums but should be resolved in the version I am using (11.1.073 - Linux). This is a MASSIVE downscaled version of my code! I unfortunately can not use static arrays to have a workaround.
If I put the print into the loop, I get other errors:
* glibc detected ./a.out: double free or corruption (out): 0x00002b22a0c016f0 **
I didn't get the errors you're getting, but you have an issue with privatizing array in your OpenMP call.
[mjswartz#666-lgn testfiles]$ vi array.f90
[mjswartz#666-lgn testfiles]$ ifort -o array array.f90 -openmp
[mjswartz#666-lgn testfiles]$ ./array
0 0 0 0 0 0
0 0 0 0
[mjswartz#666-lgn testfiles]$ vi array.f90
[mjswartz#666-lgn testfiles]$ ifort -o array array.f90 -openmp
[mjswartz#666-lgn testfiles]$ ./array
1 2 3 4 5 6
7 8 9 10
First run is with private array, second is without.
program main
implicit none
integer l
integer, allocatable, dimension(:) :: array
allocate(array(10))
!$omp parallel do
do l = 1, 10
array(l) = l
enddo
print*, array
deallocate(array)
end program main
I just ran your code with ifort and openmp and it spewed 0d0's. I had to manually quit the execution. What is your expected output? I'm not a big fan of unnecessarily dynamically allocating arrays. You know what you're going to allocate your matrices as, so just make parameters and statically do it. I'll mess with some stuff and edit this response in a few.
Ok, so here's my edits:
program main
implicit none
integer :: l, j
integer, parameter :: lmax = 15e3
integer, parameter :: jmax = 25
integer, parameter :: nk = 300
complex*16, dimension(9*nk) :: x0, xin, xout
complex*16, dimension(lmax) :: e_pump, e_probe
complex*16 :: e_pumphlp, e_probehlp
character*25 :: problemtype
real*8 :: m
! OpenMP variables
integer :: myid, nthreads, omp_get_num_threads, omp_get_thread_num
x0 = 0.0d0
problemtype = 'type1'
if (problemtype .ne. 'type1') then
write(*,*) 'Problem type not specified. Quitting'
stop
else
! Spawn a parallel region explicitly scoping all variables
!$omp parallel
myid = omp_get_thread_num()
if (myid .eq. 0) then
nthreads = omp_get_num_threads()
write(*,*) 'Starting program with', nthreads, 'threads'
endif
!$omp do private(j,l,m,e_pumphlp,e_probehlp,e_pump,e_probe)
do j = 1, jmax - 1
do l = 1, lmax
call electricfield(0.0d0, 0.0d0, e_pumphlp, &
e_probehlp, 0.0d0)
! print *, e_pumphlp, e_probehlp
e_pump(l) = e_pumphlp
e_probe(l) = e_probehlp
print *, e_pump(l), e_probe(l)
end do
end do
!$omp end parallel
end if
end program main
Notice I removed your use of a module since it was unnecessary. You have an external module containing a subroutine, so just make it an external subroutine. Also, I changed your matrices to be statically allocated. Case statements are a fancy and expensive version of if statements. You were casing 15e3*25 times rather than once (expensive), so I moved those outside. I changed the OpenMP calls, but only semantically. I gave you some output so that you know what OpenMP is actually doing.
Here is the new subroutine:
subroutine electricfield(t, tdelay, e_pump, e_probe, phase)
implicit none
real*8, intent(in) :: t, tdelay
complex*16, intent(out) :: e_pump, e_probe
real*8, optional, intent (in) :: phase
e_pump = 0.0d0
e_probe = 0.0d0
return
end subroutine electricfield
I just removed the module shell around it and changed some of your variable names. Fortran is not case sensitive, so don't torture yourself by doing caps and having to repeat it throughout.
I compiled this with
ifort -o diffeq diffeq.f90 electricfield.f90 -openmp
and ran with
./diffeq > output
to catch the program vomiting 0's and to see how many threads I was using:
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
Starting program with 32 threads
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
Hope this helps!
It would appear that you are running into a compiler bug associated with the implementation of OpenMP 3.0.
If you can't update your compiler, then you will need to change your approach. There are a few options - for example you could make the allocatable arrays shared, increase their rank by one and have one thread allocate them such that the extent of the additional dimension is the number of workers in the team. All subsequent references to those arrays then need to be have the subscript for that additional rank be the omp team number (+ 1, depending on what you've used for the lower bound).
Explicit allocation of the private allocatable arrays inside the parallel construct (only) may also be an option.

How to call subroutines in fortran parallelized by MPI?

My problem is that I don't know how to call subroutines when I use mpi scheme in Fortran.
I have written this small code named TRY.f90 in which there is a subroutine named CONCENTRATION.f90. How should I change CONCENTRATION.f90 in order to make the code works?
PROGRAM TRY
USE MPI
integer status(mpi_status_size)
INTEGER I, J, K, II, IERR, MY_ID, NUM_PROCS, PSP
INTEGER , PARAMETER :: GRIDX =64, GRIDY=64
REAL , DIMENSION(gridx,gridy) :: PSI
PSI=0
PRINT*, 'VARIABLE'
CALL MPI_INIT(IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MY_ID,IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NUM_PROCS,IERR)
CALL CONCENTRATION(GRIDX, GRIDY, NUM_PROCS, MY_ID , PSI)
IF (MY_ID .NE. 0) THEN
CALL mpi_send( PSI(1+MY_ID*GRIDX/NUM_PROCS:(MY_ID+1)*GRIDX/NUM_PROCS:1,1:GRIDY:1),&
(GRIDX/NUM_PROCS)*GRIDY,mpi_real, 0,10,mpi_comm_world,ierr)
END IF
IF (MY_ID .EQ. 0) THEN
DO II=1,NUM_PROCS-1
CALL mpi_recv(PSI(1+II*GRIDX/NUM_PROCS:(II+1)*GRIDX/NUM_PROCS:1,1:GRIDY:1),&
(GRIDX/NUM_PROCS)*GRIDY,mpi_real, &
II,10,mpi_comm_world,status,ierr)
END DO
END IF
CALL MPI_FINALIZE(IERR)
END PROGRAM TRY
I am using a subroutine named CONCENTRATION.f90 which is:
SUBROUTINE CONCENTRATION(GRIDX, GRIDY, NUM_PROCS, MY_ID , PSI)
implicit none
INTEGER*8, INTENT(IN) :: GRIDX, GRIDY
INTEGER , INTENT(IN) :: NUM_PROCS, MY_ID
REAL*8 , DIMENSION(GRIDX,GRIDY), INTENT(OUT) :: PSI
INTEGER*8 I, J
DO I=1+MY_ID*GRIDX/NUM_PROCS, (MY_ID+1)*GRIDX/NUM_PROCS
DO J=1,GRIDY
PSI(I,J)=2.0
END DO
END DO
END SUBROUTINE CONCENTRATION
The code currently gives me error since I think I should have made some changes on the subroutine CONCENTRATION.f90. Or I should also change the way I call the subroutine.
Could you please tell me what are those changes? Thanks for your helps in advance
Your program segfaults because of type mismatch. In the main program you have declared PSI as an array of REAL:
REAL , DIMENSION(gridx,gridy) :: PSI
while in the CONCENTRATION subroutine you use another type of REAL*8:
REAL*8 , DIMENSION(GRIDX,GRIDY), INTENT(OUT) :: PSI
By default REAL is 4 bytes long while REAL*8 (or DOUBLE PRECISION or REAL(KIND=8)) is 8 bytes long. So you are giving to CONCENTRATION an array that is 2 times smaller than what it believes to be and all ranks from NUM_PROCS/2 onwards write past the end of the PSI array and thus cause segfaults. If you run with one process only, then even rank 0 will segfault.
You should also read about MPI collective operations. MPI_GATHER and MPI_GATHERV do exactly what you are trying to achieve whith multiple sends and receives here.
The only change would be to declare concentration as reentrant. That could be the default for Fortran 90. (The bulk of my experience is F77, and reentrant is not the default there.)