How to I determine the local node placement of tasks in MPI

How to I determine the local node placement of tasks in MPI - fortran

I have a basic program that I am trying to run to determine where MPI will place a task given that the number of tasks is greater than the number of available processors (oversubscribing). If I run, for example, mpirun -np <program name> the result will give:
processor 0 of 4
processor 1 of 4
processor 2 of 4
processor 3 of 4
But, if I run the same command on "8" processors I get :
processor 1 of 8
processor 2 of 8
processor 5 of 8
processor 6 of 8
processor 4 of 8
processor 3 of 8
processor 7 of 8
processor 0 of 8
I understand that there are not 8 actual cores running my program and instead I have multiple tasks being run on the same processors and I want to know exactly how these are distributed. Thanks in advance.
edit:
program test
! Similar to "Hello World" example- trying to determine rank/ node placement
use mpi
implicit none
integer :: procid, ierr, numprocs, name_len
integer:: local
local= 'OMPI_COMM_WORLD_LOCAL_RANK'
!character* (MPI_max_processor_name) name
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_World, numprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_World, procid, ierr)
!call Mpi_Get_Processor_Name(name,name_len, ierr)
print*, 'processor', procid, 'of', numprocs, 'On Local Node:',' ', local
call mpi_finalize(ierr)
end program test

can you post your test program ?
is "processor x of y" coming from MPI_Comm_rank() and MPI_Comm_size() ?
in this case, these numbers are MPI ranks and have nothing to do with binding.
you'd rather read your MPI documentation and figure out how the binding is done.
an other option i sometimes use is (with Open MPI)
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
note there is a possibility no binding is performed when you oversubscribe your nodes.

Unfortunately, it is quite common in MPI to use the same words for different meanings.
For example, job managers tend to mix the word processor and use it for different meanings. In this specific case, I will use the following:
This also applies to MPI_Get_processor_name. The standard does not actually require to return a name uniquely identifying a processor. The name is left to the implementations, which usually tend to report the name of the host. This is not what I assume you are looking for.
I will use the word processor to identify a CPU core or, in an hyperthreading enabled scenario, a hardware thread (although a hardware thread is not exactly a CPU core ).
A regular process (be it MPI or not) is usually allowed to execute on different processors. This does not necessarily mean that the process will use ALL this processors, but rather being able to bounce from one to another if the former is now occupied by a different process (usually due to Operative System scheduler).
To obtain the process affinity (the list of processors that a process can use), you should use a different interface. For example, you can use something like sched_getaffinity (this is C, though). Some MPI implementations such as Intel MPI, allow you to print the process affinity at MPI_Init, setting an environment variable.
I would consider using existing programs that report the affinity. Check this page in the MPICH documentation.
Also included in the MPICH source is a program for printing out the affinity of a process (src/pm/hydra/examples/print_cpus_allowed.c) according to the OS. This can be used on Linux systems to test that bindings are working correctly.
shell$ mpiexec -n 8 -bind-to socket ./print_cpus_allowed | sort
crush[0]: Cpus_allowed_list: 0,2,4,6
crush[1]: Cpus_allowed_list: 1,3,5,7
crush[2]: Cpus_allowed_list: 0,2,4,6
crush[3]: Cpus_allowed_list: 1,3,5,7
crush[4]: Cpus_allowed_list: 0,2,4,6
crush[5]: Cpus_allowed_list: 1,3,5,7
crush[6]: Cpus_allowed_list: 0,2,4,6
crush[7]: Cpus_allowed_list: 1,3,5,7

Related

Calling MPI_Comm_Split function

My question is probably stupid but still I'm gonna ask it to be sure !
Question : Do you expect the two codes below to work the same using MPI_Comm_Split to build 1 sub-communicator ? (for example, let's say I'm running the code with 6 procs with a rank between 0 and 5)
NB : The code is in fortran 90 with intel compiler 2019 and I use Mpich for the Mpi.
CODE 1
call Mpi_Init(ierror)
call Mpi_Comm_Rank(mpi_comm_world,rank,ierror)
if (rank > 2) then
call Mpi_Comm_Split(mpi_comm_world,0,rank,new_comm,ierror)
else
call Mpi_Comm_Split(mpi_comm_world,mpi_undefined,rank,new_comm,ierror)
endif
CODE 2
call Mpi_Init(ierror)
call Mpi_Comm_Rank(mpi_comm_world,rank,ierror)
if (rank > 2) then
color = 0
else
color = mpi_undefined
endif
call Mpi_Comm_Split(mpi_comm_world,color,rank,new_comm,ierror)
The Mpi_Comm_Split is not called the same way in the 2 codes but to me, it should behave the same but I'm not sure... I read that Mpi_Comm_Split has to be invoked at the same line but how procs can know that the call of Mpi_Comm_Split is done at a one line or another one (it doesn't make anysense to me) ?!
NB : With Mpich and intel fortran, I tested it and both implentation of the communicator splitting works, but I'm afraid of the behavior of different Mpi compilers...

Assuming you declared color correctly, both codes are equivalent.
MPI_Comm_split() is a collective operation, and hence must be invoked by all the ranks of the parent communicator. That does not mandate the call has to be performed by the same line of code.

Fortran execution time

I am new with Fortran and I would like to ask for help. My code is very simple. It just enters a loop and then using system intrinsic procedure enters the file with the name code and runs the evalcode.x program.
program subr1
implicit none
integer :: i,
real :: T1,T2
call cpu_time(T1)
do i=1,6320
call system ("cd ~/code; ../evalcede/source/evalcode.x test ")
enddo
call cpu_time(T2)
print *, T1,T2
end program subr1
The time measured that the program is actually running is 0.5 sec, but time that this code actually needs for execution is 1.5 hours! The program is suspended or waiting and I do not know why.

note: this is more an elaborated comment to the post of Janneb to provide a bit more information.
As indicated by Janneb, the function CPU_TIME does not necesarily return wall-clock time, what you are after. This especially when timing system calls.
Furthermore, the output of CPU_TIME is really a processor and compiler dependent value. To demonstrate this, the following code is compiled with gfortran, ifort and solaris-studio f90:
program test_cpu_time
real :: T1,T2
call cpu_time(T1)
call execute_command_line("sleep 5")
call cpu_time(T2)
print *, T1,T2, T2-T1
end program test_cpu_time
#gfortran>] 1.68200000E-03 1.79799995E-03 1.15999952E-04
#ifort >] 1.1980000E-03 1.3410000E-03 1.4299992E-04
#f90 >] 0.0E+0 5.00534 5.00534
Here, you see that both gfortran and ifort exclude the time of the system-command while solaris-studio includes the time.
In general, one should see the difference between the output of two consecutive calls to CPU_TIME as the time spend by the CPU to perform the actions. Due to the system call, the process is actually in a sleep state during the time of execution and thus no CPU time is spent. This can be seen by a simple ps:
$ ps -O ppid,nlwp,psr,stat $(pgrep sleep) $(pgrep a.out)
PID PPID NLWP PSR STAT S TTY TIME COMMAND
27677 17146 1 2 SN+ S pts/40 00:00:00 ./a.out
27678 27677 1 1 SN+ S pts/40 00:00:00 sleep 5
NLWP indicates how many threads in use
PPID indicates parent PID
STAT indicates 'S' for interruptible sleep (waiting for an event to complete)
PSR is the cpu/thread it is running on.
You notice that the main program a.out is in a sleep state and both the system call and the main program are running on separate cores. Since the main program is in a sleep state, the CPU_TIME will not clock this time.
note: solaris-studio is the odd duck, but then again, it's solaris studio!
General comment: CPU_TIME is still useful for determining the execution time of segments of code. It is not useful for timing external programs. Other more dedicated tools exist for this such as time: The OP's program could be reduced to the bash command:
$ time ( for i in $(seq 1 6320); do blabla; done )
This is what the standard has to say on CPU_TIME(TIME)
CPU_TIME(TIME)
Description: Return the processor time.
Note:13.9: A processor for which a single result is inadequate (for example, a parallel processor) might choose to
provide an additional version for which time is an array.
The exact definition of time is left imprecise because of the variability in what different processors are able
to provide. The primary purpose is to compare different algorithms on the same processor or discover which
parts of a calculation are the most expensive.
The start time is left imprecise because the purpose is to time sections of code, as in the example.
Most computer systems have multiple concepts of time. One common concept is that of time expended by
the processor for a given program. This might or might not include system overhead, and has no obvious
connection to elapsed “wall clock” time.
source: Fortran 2008 Standard, Section 13.7.42
On top of that:
It is processor dependent whether the results returned from CPU_TIME, DATE_AND_TIME and SYSTEM_CLOCK are dependent on which image calls them.
Note 13.8: For example, it is unspecified whether CPU_TIME returns a per-image or per-program value, whether all
images run in the same time zone, and whether the initial count, count rate, and maximum in SYSTEM_CLOCK are the same for all images.
source: Fortran 2008 Standard, Section 13.5

The CPU_TIME intrinsic measures CPU time consumed by the program itself, not including those of it's subprocesses (1).
Apparently most of the time is spent in evalcode.x which explains why the reported wallclock time is much higher.
If you want to measure wallclock time intervals in Fortran, you can use the SYSTEM_CLOCK intrinsic.
(1) Well, that's what GFortran does, at least. The standard doesn't specify exactly what it means.

Decorate Write-Operation for MPI-usage [duplicate]

I have a parallel fortran code in which I want only the rank=0 process to be able to write to stdout, but I don't want to have to litter the code with:
if(rank==0) write(*,*) ...
so I was wondering if doing something like the following would be a good idea, or whether there is a better way?
program test
use mpi
implicit none
integer :: ierr
integer :: nproc
integer :: rank
integer :: stdout
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, rank, ierr)
call mpi_comm_size(mpi_comm_world, nproc, ierr)
select case(rank)
case(0)
stdout = 6
case default
stdout = 7
open(unit=stdout, file='/dev/null')
end select
write(stdout,*) "Hello from rank=", rank
call mpi_finalize(ierr)
end program test
This gives:
$ mpirun -n 10 ./a.out
Hello from rank= 0
Thanks for any advice!

There are two disadvantages to your solution:
This "clever" solution actually obscures the code, since it lies: stdout isn't stdout any more. If someone reads the code he/she will think that all processes are writing to stdout, while in reality they aren't.
If you want all processes to write to stdout at some point, what will you do then? Add more tricks?
If you really want to stick with this trick, please don't use "stdout" as a variable for the unit number, but e.g. "master" or anything that indicates you're not actually writing to stdout. Furthermore, you should be aware that the number 6 isn't always stdout. Fortran 2003 allows you to check the unit number of stdout, so you should use that if you can.
My advice would be to stay with the if(rank==0) statements. They are clearly indicating what happens in the code. If you use lots of similar i/o statements, you could write subroutines for writing only for rank 0 or for all processes. These can have meaningful names that indicate the intended usage.

mpirun comes with the option to redirect stdout from each process into separate files. For example, -output-filename out would result in out.1.0, out.1.1, ... which you then can monitor using whatever way you like (I use tail -f). Next to if(rank.eq.0) this is the cleanest solution I think.

I am not so concerned with the two disadvantages mentioned by steabert. We can work that out by introducing another file descriptor that clearly indicates that it is stdout only on master process, e.g. stdout -> stdout0.
But my concern is here: The /dev/null will work in UNIX-like environment. Will it work on Windows environment? How about the funky BlueGene systems?

Is this the proper way to define number of threads in an fortran Program?

I have a FORTRAN program written for the parallel computing. The program takes the arguments and the number of threads can be defined as the argument. The sample code is as follows:
COUNT = NARGS()
NTHREADS = 1
! *** GET THE COMMAND LINE ARGUMENTS, IF ANY
IF(COUNT.GT.1)THEN
! *** ARGUMENT 1
CALL GETARG(1, BUFFER, iStatus)
IF (Buffer(1:4).EQ.'-NOP'.OR.Buffer(1:4).EQ.'-nop') THEN
PAUSEIT=.FALSE.
ENDIF
IF (Buffer(1:3).EQ.'-NT'.OR.Buffer(1:3).EQ.'-nt') THEN
READ(Buffer(4:10),*) NTHREADS
ENDIF
IF(COUNT.GT.2)THEN
! *** ARGUMENT 2
CALL GETARG(2, BUFFER, iStatus)
IF (Buffer(1:4).EQ.'-NOP'.OR.Buffer(1:4).EQ.'-nop') THEN
PAUSEIT=.FALSE.
ENDIF
IF (Buffer(1:3).EQ.'-NT'.OR.Buffer(1:3).EQ.'-nt') THEN
READ(Buffer(4:10),*) NTHREADS
ENDIF
ENDIF
ENDIF
Let's say my compiled file name is "hellofortran". I can define the number of threads as
./hellofortran -nt4
My program will read the program with 4 threads. The problem is that I can run with as many cores in any computer. Lets say I have dual core processor. I have only two cores but I can still run with 6-8 threads or any number. How can I properly define the number of threads in this particular instance ?
I hope I explained my problem. Looking forward to hearing soon on how can I improve my program. Thanks.
Jdbaba

If you're using OpenMP and just looking to set up how many threads to use, I would just specify the number of threads in the environment:
OMP_NUM_THREADS=4
./hellofortran
and write your OpenMP code as you would normally. There are programmatic ways of setting thread counts but this is likely more straightforward for you.

standard output in Fortran MPI code

I have a parallel fortran code in which I want only the rank=0 process to be able to write to stdout, but I don't want to have to litter the code with:
if(rank==0) write(*,*) ...
so I was wondering if doing something like the following would be a good idea, or whether there is a better way?
program test
use mpi
implicit none
integer :: ierr
integer :: nproc
integer :: rank
integer :: stdout
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, rank, ierr)
call mpi_comm_size(mpi_comm_world, nproc, ierr)
select case(rank)
case(0)
stdout = 6
case default
stdout = 7
open(unit=stdout, file='/dev/null')
end select
write(stdout,*) "Hello from rank=", rank
call mpi_finalize(ierr)
end program test
This gives:
$ mpirun -n 10 ./a.out
Hello from rank= 0
Thanks for any advice!

There are two disadvantages to your solution:
This "clever" solution actually obscures the code, since it lies: stdout isn't stdout any more. If someone reads the code he/she will think that all processes are writing to stdout, while in reality they aren't.
If you want all processes to write to stdout at some point, what will you do then? Add more tricks?
If you really want to stick with this trick, please don't use "stdout" as a variable for the unit number, but e.g. "master" or anything that indicates you're not actually writing to stdout. Furthermore, you should be aware that the number 6 isn't always stdout. Fortran 2003 allows you to check the unit number of stdout, so you should use that if you can.
My advice would be to stay with the if(rank==0) statements. They are clearly indicating what happens in the code. If you use lots of similar i/o statements, you could write subroutines for writing only for rank 0 or for all processes. These can have meaningful names that indicate the intended usage.

mpirun comes with the option to redirect stdout from each process into separate files. For example, -output-filename out would result in out.1.0, out.1.1, ... which you then can monitor using whatever way you like (I use tail -f). Next to if(rank.eq.0) this is the cleanest solution I think.

I am not so concerned with the two disadvantages mentioned by steabert. We can work that out by introducing another file descriptor that clearly indicates that it is stdout only on master process, e.g. stdout -> stdout0.
But my concern is here: The /dev/null will work in UNIX-like environment. Will it work on Windows environment? How about the funky BlueGene systems?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js