MPI_Allreduce failed in MPICH2

MPI_Allreduce failed in MPICH2 - fortran

I recently working with MPI. I am still very new to MPI. But I recently find a problem when I using MPICH2. Here is my little fortran 90 program modified from Hello world program. I haven't test the c version of it but I think they should be very similar (differed by the function name and the error prameter).
I am working on Windows 7 64bit, MinGW (gcc version 4.6.2, and it is 32bit compiler) and using MPICH2 1.4.1-p1 32bit version. Here is the command that I used to compile the simple code:
gfortran hello1.f90 -g -o hello.exe -IC:\MPICH2_x86\include -LC:\MPICH2_x86\lib -lfmpich2g
And here is the simple code:
program main
include 'mpif.h'
character * (MPI_MAX_PROCESSOR_NAME) processor_name
integer myid, numprocs, namelen, rc,ierr
integer, allocatable :: mat1(:, :, :)
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
call MPI_GET_PROCESSOR_NAME(processor_name, namelen, ierr)
allocate(mat1(-36:36, -36:36, -36:36))
mat1(:,:,:) = 0
call MPI_Bcast( mat1(-36, -36, -36), 389017, MPI_INT, 0, MPI_COMM_WORLD, ierr )
call MPI_Allreduce(MPI_IN_PLACE, mat1(-36, -36, -36), 389017, MPI_INTEGER, MPI_BOR, MPI_COMM_WORLD, ierr)
print *,"MPI_Allreduce done!!!"
print *,"Hello World! Process ", myid, " of ", numprocs, " on ", processor_name
call MPI_FINALIZE(rc)
end
It can be compiled, but however it failed when running (maybe invalid memory access?). There must be some problem with MPI_Allreduce since it works fine if I remove that line. And it also works if I make the matrix smaller. I tried it on a ubuntu machine with same version MPI as well. No problem in Linux.
When I use gdb (comes with MinGW) to check (gdb hello.exe then backtrace). I got something meaningless (or seems to be for myself):
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 16316.0x4fd0]
0x01c03100 in mpich2nemesis!PMPI_Wtime ()
from C:\Windows\system32\mpich2nemesis.dll
(gdb) backtrace
#0 0x01c03100 in mpich2nemesis!PMPI_Wtime ()
from C:\Windows\system32\mpich2nemesis.dll
#1 0x0017be00 in ?? ()
#2 0x00000000 in ?? ()
Does this actually mean there are something wrong with the windows version MPI library?
What will be the solution to make it work?
Thanks.

This might not fix your problem, but MPI_INT is not a fortran-mpi datatype. MPI_INTEGER is the corresponding datatype. Different implementations may provide MPI_INT on the fortran side, but I'm pretty sure that this is not defined by the standard. Try compiling your code with IMPLICIT NONE and see if it complains (also test if MPI_INTEGER .ne. MPI_INT). If it complains, what is happening is that MPI_INT is getting assigned some value by the compiler (or your version of MPI uses MPI_INT for some other datatype...). This may conflict with one of the pre-defined values set by MPI. Thus, it is treating your array of integers as some other type which could result in a buffer overflow which can manifest itself in all kinds of funny ways.

Related

"Segmentation fault - invalid memory reference" possibly related to number types problem. How do I change my number types?

I am trying to use an old Fortran code for processing data. I have little experience with Fortran and have been unable to get past a problem that I think is to do with number types.
Part of the code I am using is at the bottom of this question. I am pretty sure, but not certain, that the second-last line (also the second-last line of this post) is the problem.
First I did this:
gfortran -g cpt_ir_.f90 -o cpt_ir_.o
./cpt_ir_.o < di.in
It resulted in this error:
Backtrace for this error:
#0 0x2B794A134467
#1 0x2B794A134AAE
#2 0x2B794ABC724F
#3 0x2B794A1FB8AB
#4 0x2B794A1F7613
#5 0x2B794A1F934E
#6 0x2B794A1FDF86
#7 0x40128A in MAIN__ at cpt_ir_.f90:29
Segmentation fault
I searched Stack Overflow and saw a suggestion to do the following to get more information:
gfortran -g -fcheck=all -Wall cpt_ir_.f90
The output is shown directly below. The Fmax... line is the final line of the code I pasted at the end of this post (further in the code there are other similar lines). However, I see that it is shown as a warning, not an error. So I although I proceed here as though it is the error, maybe there is another problem that the command above did not reveal.
cpt_ir_.f90:66.5:
Fmax=N*((4000.0/(2.0*Pi))*(2.0*Pi*timestep*1.0-15.0*29979245800.0))
1
Warning: Possible change of value in conversion from REAL(8) to INTEGER(4) at (1)
I came across a suggestion here at Stack Overflow to use the following flags:
-fdefault-integer-8 -fno-range-check
Which I did as follows:
gfortran -g -fdefault-integer-8 -fno-range-check cpt_ir_.f90 -o cpt_ir_.o
I'm not sure if I did it correctly. I also tried them one by one. Anyway, there was no change and I got the same error. I also tried manually changing the numbers in the problem line as shown in the final line of this post. That didn't help either--I got an error that the largest number was too large for an int.
If anyone could please point me in the right direction, I would be very grateful. Please also feel free to change my tags if there are more appropriate tags for this question.
cpt_ir_.f90:
!
!
IMPLICIT NONE
INTEGER, PARAMETER :: dp=KIND(0.0D0)
REAL(KIND=dp), DIMENSION(:), ALLOCATABLE :: correlation
REAL(KIND=dp) :: integral,omega,Pi,timestep
REAL(KIND=dp), DIMENSION(3,1000000) :: dipder,dip
REAL(KIND=dp), DIMENSION(3) :: m_vec
INTEGER :: N,I,J,Nmax,Fmax
CHARACTER(LEN=100) :: line,filename
Pi=4.0D0*ATAN(1.0D0)
READ(5,*) filename
READ(5,*) timestep
OPEN(10,FILE=filename)
N=0
DO
READ(10,'(A100)',END=999) line
IF (INDEX(line,' XXX').NE.0) THEN
N=N+1
READ(line(45:),*) dip(:,N)
ENDIF
IF (INDEX(line,' XXX').NE.0) THEN
N=N+1
READ(line(45:),*) dipder(:,N)
ENDIF
ENDDO
999 CONTINUE
CLOSE(10)
Nmax=N/10
print *, Nmax
ALLOCATE(correlation(0:Nmax))
correlation=0.0_dp
DO I=1,N-Nmax
DO J=I,I+Nmax
correlation(J-I)=correlation(J-I)+DOT_PRODUCT(dipder(:,I),dipder(:,J))
ENDDO
ENDDO
DO I=0,Nmax
correlation(I)=correlation(I)/(REAL(N-I,kind=dp)*REAL(N,kind=dp))
ENDDO
OPEN(UNIT=10,FILE="dip_dip_correlation.time")
write(10,*) "XXX"
DO I=-Nmax,Nmax
write(10,*) I*timestep,correlation(ABS(I))/correlation(0)
ENDDO
CLOSE(10)
OPEN(UNIT=10,FILE="XXX")
write(10,*) "XXX"
!Fmax up to 4000 cm^-1
Fmax=N*((4000D0/(2*Pi))*(2.0D0*Pi*timestep*1.0D-15*29979245800.0))
! My try: Fmax=N*((4000.0/(2.0*Pi))*(2.0*Pi*timestep*1.0-15.0*29979245800.0))
Update based on Dan's answer:
Dan kindly pointed out that I needed to uncomment "N=N+1." Unfortunately, after fixing that, I am still seeing the segmentation fault. Just now when I ran:
gfortran -g -fcheck=all -Wall cpt_ir_.f90
on my try at the last line of the code (where I tried converting everything to a float):
Fmax=N*((4000.0/(2.0*Pi))*(2.0*Pi*timestep*1.0-15.0*29979245800.0))
I got:
Fmax=N*((4000.0/(2.0*Pi))*(2.0*Pi*timestep*1.0-15.0*29979245800.0))
1
Warning: Possible change of value in conversion from REAL(8) to INTEGER(4) at (1)

At a glance, your line
READ(line(45:),*) dip(:,N)
is the first problem. You comment out the N=N+1 line so N = 0. Fortran is '1' indexed meaning that Fortran arrays start at 1 unless otherwise specified. So the second dimension of dip starts at 1 and you are trying to set the 'zeroth' element which does not exist.

Problem when reading Fortran file with CodeBlocks IDE

Recently, i've begun learning Fortran programmation language.
I am using CodeBlocks IDE with GNU Fortran Compiler.
I have a problem in simple code that i found in a Fortran Course online that explains how to read and write from a file.
The program is the following:
program main
implicit none
character (len=14) :: c1,c2,c3
integer :: n
real :: T
open(unit=10,file='titi.txt')
read(10,*) c1,n,c2
read(10,*) c3,T
close(10)
open(unit=20,file='toto.txt')
write(20,*) c1,'il est',n,c2
write(20,*)'la',c3,'est de',T,'degres'
close(20)
end
Where the file 'titi.txt' contains:
bonjour 4 heures
temperature 37.2
The error message that appears in the console is the following:
Program received signal SIGSEGV: Segmentation fault - invalid memory
reference.
Backtrace for this error:
#0 ffffffff
I tried using the flag
-g
And than i found using the debugger that the problem is in the first line where 'read' was used
read(10,*) c1,n,c2
I really don't know how to deal with this. The code seems pretty simple to me and i have never seen this error message before, so i don't know what does it mean.
Thanks for your answers in advance.

Thank you all for your responds.
Actually what caused the problem is that i was using an old compiler. So when i downloaded the last version it all worked perfectly without changing any line in the code.

This is not an answer, but it's too much text for a comment.
It's running fine on my computer.
Can you compile it with
gfortran -g -O0 -fbacktrace -Wall -fcheck=all
That way you should get a lot more information. Also, you can add some error checking:
Add the following variables:
integer :: ios
character(len=100) :: iomsg
Then you can add error checking to all io statements like this:
read(10,*) c1,n,c2
becomes:
read(10,*,iostat=ios,iomsg=iomsg) c1,n,c2
if (ios /= 0) then
print*, "Error reading c1, n, c2:"
print*, trim(iomsg)
STOP
end if
That can also give you some hints.

Fortran inquire return internal file

I have a situation unexpected with my fortran program.
I want to drop some data in a text file. For some reasons, I created a type to deal with this file.
Therefore, I open a file with the commands (where f is my file type object):
open(newunit = f%unit, &
file = trim(adjustl(f%name)), &
form = 'FORMATTED', &
access = 'STREAM', &
action = 'WRITE', &
status = 'REPLACE', &
iostat = ios)
if(ios /= 0) print '("Problem creating file")', trim(f%name)
Then I write stuff like this:
write(unit=f%unit,fmt='(100A)') "#Header"
And when it is done, I want to close the file. To do this, I call the subroutine:
subroutine close_file_ascii(f)
implicit none
class(my_file_type) intent(in) :: f
logical :: file_opened, file_exist
integer :: ios
inquire(unit=f%unit, exist=file_exist, opened=file_opened) !, iostat=ios)
print *,'file_exist:', file_exist, 'file_opened:', file_opened, 'ios:', ios
if ((file_opened) .and. (file_exist)) then
close(unit=f%unit)
else
print *,"[WARNING]"
end if
end subroutine close_file_ascii
My problem is in this last subroutine. When I run the program on windows, I get the following error:
Fortran runtime error: Inquire statement identifies an internal file
Error termination. Backtrace
Therefore, I tried to create MWE to understand the problem, but all of them where working well. So couldn't really isolate the problem. Also a strange thing is that when I compile and execute with gfortran on my linux there is no problem, but when I do so on my windows I get the previous error. ( I compile on windows with gfortran version 7.3.0 x86_64-posix-sjlj-rev0, Built by MinGW-W64 )
I already work-around this problem by uncommenting the end of inquire line in the close subroutine, and everything seems to work fine. And I get the following print:
file_exist: T file_opened: T ios: 5018
But I would to understand what is going on. So what could create such internal file error (while the file should not be internal but external)? Is my workaround could be harmful ? Is it a bug ? Is there a better way to close safely an opened file? Any ideas to isolate the problem ?
EDIT
From roygvib's comment, the following test seems to replicate the problem:
program bug
implicit none
integer :: i
character(len=1) :: s
write (s,'(i1)') 0
open(newUnit=i,file='bug.txt',status='unknown')
inquire(unit=i)
end program bug

The update to gfortran 8.1 solved the problem.

why move_alloc is not working in gfortran (4.6.3), but it is in ifort?

I've read here that move_alloc works since gfortran 4.2. I'm with gfortran 4.6 installed in my ubuntu 12.04 but move_alloc is not working!
The move_alloc is used five times inside a loop that runs 10 times.
After compiled (without any error or warning) with gfrotran, the program runs only one step (some prints to verify any mistake) of the loop and shows "segmentation fault(kernel image recorded)'. However, when ifort is used, the program runs and works correctly.
I also have tried to use gfortran 4.4.6 in CentOS.
Both computers are x86_64.
Other important information: this piece of the code is in a subroutine, inside a module, once I don't know previously the size of the vector that are allocated by move_alloc. All these vector are with the attribute intent (out) in the subroutine. xray_all, yray_all and elem_all are double precision and the other are integer. The main and the module are in different files.
Here is the piece of the code where I use move_alloc:
program main
double precision,allocatable,dimension(:)::xrayall,yrayall
(...)other allocatable variables
call yyyy(....,ray_indent,xray_all,...)
end program main
module xxxx
subroutine yyyy
do j=1,10
<lots of calculation>
allocate( vec_aux( 1:(i+size(ray_indent) ) ) )
vec_aux(1:size(ray_indent))=ray_indent
vec_aux(size(ray_indent)+1:)=j
call MOVE_ALLOC(vec_aux,ray_indent)
allocate( vec_auxreal( 1:(i+size(xray_all) ) ) )
vec_auxreal(1:size(xray_all))=xray_all
vec_auxreal(size(xray_all)+1:)=xray
call MOVE_ALLOC(vec_auxreal,xray_all)
allocate( vec_auxreal( 1:(i+size(yray_all) ) ) )
vec_auxreal(1:size(yray_all))=yray_all
vec_auxreal(size(yray_all)+1:)=yray
call MOVE_ALLOC(vec_auxreal,yray_all)
elemsize=count(icol/=0);
allocate( vec_auxreal( 1:(elemsize+size(elem_all) ) ) )
vec_auxreal(1:size(elem_all))=elem_all
vec_auxreal(size(elem_all)+1:)=elem(1:elemsize)
call MOVE_ALLOC(vec_auxreal,elem_all)
allocate( vec_aux( 1:(elemsize+size(icol_all) ) ) )
vec_aux(1:size(icol_all))=icol_all
vec_aux(size(icol_all)+1:)=icol(1:elemsize)
call MOVE_ALLOC(vec_aux,icol_all)
allocate( vec_aux( 1:(elemsize+size(irow_all) ) ) )
vec_aux(1:size(irow_all))=irow_all
vec_aux(size(irow_all)+1:)=j+control !
call MOVE_ALLOC(vec_aux,irow_all)
end do
end module xxxx
end subroutine yyyy

I have found the solution! In gfortran is necessary to add the following if statement in all five expressions:
allocate( vec_auxreal( 1:(elemsize+size(elem_all) ) ) )
if (j/=1) vec_auxreal(1:size(elem_all))=elem_all
vec_auxreal(size(elem_all)+1:)=elem(1:elemsize)
call MOVE_ALLOC(vec_auxreal,elem_all)
This happens because, in gfortran if the vector is still empty, it is not recognized that nothing is to be added. In ifort (tested in version 12.0), this if statement is not necessary to the program work.

Does a program started with mpiexec know it was started with mpiexec?

I'm in the process of adding an option to a Fortran program to run it using multiple processors using MPI. If the user is going to run it in parallel, the user needs to specify different input files---one file for each domain (processor) of the problem. The program will look for a specific filename by default (a file called "serial.inp"). So I need the program to know when it is being run in parallel so that it can instead look for the other filenames instead (e.g. "parallel_1.inp", "parallel_2.inp", "parallel_3.inp", etc.). My first thought is to have the user pass an argument to the program when they execute it, e.g.:
mpiexec -n 4 myprogram.exe -parallel
This way, it will look for the parallel files when that argument is present. But it seems kind of redundant. If the program is being called with mpiexec, there is no question that the user is attempting to run it in parallel. Is there any way that my program will know it was started using mpiexec? Or is the command line argument my best bet?

Alex Leach is right in that you can do this with MPI-implementation-specific environment variable lookups, but there's no portable way to do this.
But as I understand, I don't think you really need to; you can get most of what you want with just checking to see if it was run with one rank:
program filenames
use mpi
implicit none
integer :: comsize, rank, ierr
character(len=128) :: inputfilename
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD,comsize,ierr)
call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
if (comsize == 1) then
inputfilename = 'serial.inp'
else
write(inputfilename, '(A,I0,A)'), 'parallel_',rank,'.imp'
endif
write(*,'(I,1X,A)'), rank, trim(inputfilename)
call MPI_Finalize(ierr)
end program filenames
Running gives
$ mpirun -np 4 ./filenames
0 parallel_0.imp
1 parallel_1.imp
2 parallel_2.imp
3 parallel_3.imp
$ ./filenames
0 serial.inp
That's not perfect; it'll give the serial result if you run using mpirun -np 1 filenames, but depending on your use case that may not be a terrible thing in exchange for having something portable.

Processes run with mpiexec will have various environment variables set, indicating to the subprocesses whether they are the master process or slaves, amongst other things.
Look in your mpiexec's documentation for specific details. Microsoft have some documentation online too.

Why not do it programmatically? This is how I do it in my program:
#ifdef MPI
CALL MPI_Init(ierr) ! Initialize MPI
CALL MPI_Comm_rank(mpicomm,nproc,ierr) ! Who am I?
CALL MPI_Comm_size(mpicomm,size,ierr) ! How many processes?
#else
nproc = 0
size = 1
#endif
After this point in the program, you can inquire whether the program is serial or parallel by inquiring the value of size.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MPI_Allreduce failed in MPICH2 - fortran

Related

"Segmentation fault - invalid memory reference" possibly related to number types problem. How do I change my number types?

Problem when reading Fortran file with CodeBlocks IDE

Fortran inquire return internal file

why move_alloc is not working in gfortran (4.6.3), but it is in ifort?

Does a program started with mpiexec know it was started with mpiexec?

Categories

Resources