MPI_Comm_Spawn called multiple times - c++

We are writing a code to solve non linear problem using an iterative method (Newton). Anyway, the problem is that we don't know a priori how many MPI processes will be needed from one iteration to another, due to e.g. remeshing, adaptivity, etc. And there is quite a lot of iterations...
We hence would like to use MPI_Comm_Spawn at each iteration to create as much MPI process as we need, gather the results and "destroy" the subprocesses. We know this limits the scalability of the code due to the gathering of information, however, we have been asked to do it :)
I did a couple of tests of MPI_Comm_Spawn on my laptop (on windows 7/64bit) using intel MPI and Visual Studio express 2013. I tried these simple codes
//StackMain
#include <iostream>
#include <mpi.h>
#include<vector>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
for (int i = 0; i < 10000; i++)
{
std::cout << "Loop number "<< i << std::endl;
MPI_Comm children;
std::vector<int> err(4);
ierr = MPI_Comm_spawn("StackWorkers.exe", NULL, 4, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &children, &err[0]);
MPI_Barrier(children);
MPI_Comm_disconnect(&children);
}
ierr = MPI_Finalize();
return 0;
}
And the program launched by the spawned processes:
//StackWorkers
#include <mpi.h>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
MPI_Comm parent;
ierr = MPI_Comm_get_parent(&parent);
MPI_Barrier(parent);
ierr = MPI_Finalize();
return 0;
}
The program is launched using one MPI process:
mpiexec -np 1 StackMain.exe
It seems to work, I do have however some questions...
1- The program freezes during iteration 4096, this number do not change if I relaunch the program. If during each iteration I launch 2 times 4 process, then it will stop at iteration 2048th...
Is it a limitation from the operating system ?
2- When I look at the memory occupied by "mpiexec" during the program, it grows continuously (never going down). Do you know why ? I though that, when subprocess finnished their job, they would release the memory they used...
3- Should I disconnect/free the children communicator or not ? If yes, MPI_Disconnect(...) must be called on both spawned and spawnee processes ? Or only spawnee ?
Thanks a lot!

Related

How to stop and resume all processes by one process in MPI?

In MPI what is the correct way that at a specific moment, one process (I call it process A) stops all processes doing the application tasks and then that process A resumes all? For example:
#include <stdio.h>
#include <mpi.h>
#include <unistd.h>
int main(int argc, char **argv)
{
int num_procs, my_rank, my_id;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
my_id = getpid();
printf("Hello! I'm process with rank %i out of %i processes and my id is %i\n",
my_rank, num_procs, my_id);
// here I want process A stops all processes
// ....
usleep(3000000); // sleep for 3 seconds
// here I want that process A resumes all processes
printf("Bye! I'm process %i out of %i processes and my id is %i\n",
my_rank, num_procs, my_id);
MPI_Finalize();
return 0;
}
I tried using kill(pid, SIGSTOP); and kill(pid, SIGCONT); but the problem is that the id of each process obtained through getpid() is required so that that process A can use those ids to stop and resume the processes. I tried to see if I can obtain the processes ids and then one process can access them. So, I modified the above code as follows.
#include <stdio.h>
#include <mpi.h>
#include <unistd.h>
#include <vector>
std::vector<int> ids {};
int main(int argc, char **argv)
{
int num_procs, my_rank;
ids.push_back(getpid());
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
if (my_rank == 0){
for (const int& i : ids) {
std::cout << i << "\n";
}
}
MPI_Finalize();
return 0;
}
But due to the parallelism of processes, it didn't work and for loop only prints out the id of the process with rank 0.
To rephrase my question, is there any way in MPI to stop and resumes processes?
UPDATE
Since I was asked in the comments why I need to do so, I add more explanations here. I intend to implement a coordinated checkpointing approach. It works as follows. One process as a coordinator at a specific time stops other processes from executing the main program tasks. Then each process takes its checkpoint and subsequently informs the coordinator process that checkpointing is done. When all processes finish taking their checkpoints, the coordinator resumes the other processes to carry on executing the main program tasks. The coordinator process can be one of the processes that execute the main program, and at checkpointing time, it also takes its checkpoint.

Limitation of data exchange using MPI c++

I am writing a MPI C++ code for data exchange, below is the sample code:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int size, rank;
int dest, tag, i;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("SIZE = %d RANK = %d\n",size,rank);
if ( rank == 0 )
{
double data[500];
for (i = 0; i < 500; i++)
{
data[i] = i;
}
dest = 1;
tag = 1;
MPI_Send(data,500,MPI_DOUBLE,dest,tag, MPI_COMM_WORLD );
}
MPI_Finalize();
return(0);
}
Looks like 500 is the maximum that I can send. If the number of data increases to 600, the code seems to stop at "MPI_SEND" without further progress. I am suspecting is there any limitation for the data be transferred using MPI_SEND. Can you someone enlighten me?
Thanks in advance,
Kan
Long story short, your program is incorrect and you are lucky it did not hang with a small count.
Per the MPI standard, you cannot assume MPI_Send() will return unless a matching MPI_Recv() has been posted.
From a pragmatic point of view, "short" messages are generally sent in eager mode, and MPI_Send() likely returns immediately.
On the other hand, "long" messages usually involve a rendez-vous protocol, and hence hang until a matching receive have been posted.
"small" and "long" depend on several factor, including the interconnect you are using. And once again, you cannot assume MPI_Send() will always return immediately if you send messages that are small enough.

Strange behavior when mixing openMP with openMPI

I have some code that is parallelized using openMP (on a for loop). I wanted to now repeat the functionality several times and use MPI to submit to a cluster of machines, keeping the intra node stuff to all still be openMP.
When I only use openMP, I get the speed up I expect (using twice the number of processors/cores finishes in half the time). When I add in the MPI and submit to only one MPI process, I do not get this speed up. I created a toy problem to check this and still have the same issue. Here is the code
#include <iostream>
#include <stdio.h>
#include <unistd.h>
#include "mpi.h"
#include <omp.h>
int main(int argc, char *argv[]) {
int iam=0, np = 1;
long i;
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double t1 = MPI_Wtime();
std::cout << "!!!Hello World!!!" << std::endl; // prints !!!Hello World!!!
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
int nThread = omp_get_num_procs();//omp_get_num_threads here returns 1??
printf("nThread = %d\n", nThread);
int *total = new int[nThread];
for (int j=0;j<nThread;j++) {
total[j]=0;
}
#pragma omp parallel num_threads(nThread) default(shared) private(iam, i)
{
np = omp_get_num_threads();
#pragma omp for schedule(dynamic, 1)
for (i=0; i<10000000; i++) {
iam = omp_get_thread_num();
total[iam]++;
}
printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
iam, np, rank, numprocs,processor_name);
}
int grandTotal=0;
for (int j=0;j<nThread;j++) {
printf("Total=%d\n",total[j]);
grandTotal += total[j];
}
printf("GrandTotal= %d\n", grandTotal);
MPI_Finalize();
double t2 = MPI_Wtime();
printf("time elapsed with MPI clock=%f\n", t2-t1);
return 0;
}
I am compiling with openmpi-1.8/bin/mpic++, using the -fopenmp flag. Here is my PBS script
#PBS -l select=1:ncpus=12
setenv OMP_NUM_THREADS 12
/util/mpi/openmpi-1.8/bin/mpirun -np 1 -hostfile $PBS_NODEFILE --map-by node:pe=$OMP_NUM_THREADS /workspace/HelloWorldMPI/HelloWorldMPI
I have also tried with #PBS -l nodes=1:ppn=12, get the same results.
When using half the cores, the program is actually faster (twice as fast!). When I reduce the number of cores, I change both ncpus and OMP_NUM_THREADS. I have tried increasing the actual work (adding 10^10 numbers instead of 10^7 shown here in the code). I have tried removing the printf statements wondering if they were somehow slowing things down, still have the same problem. Top shows that I am using all the CPUs (as set in ncpus) close to 100%. If I submit with -np=2, it parallelizes beautifully on two machines, so the MPI seems to be working as expected, but the openMP is broken
Out of ideas now, anything I can try. What am I doing wrong?
I hate to say it, but there's a lot wrong and you should probably just familiarize
yourself more with OpenMP and MPI. Nevertheless, I'll try to go through your code
and point out the errors I saw.
double t1 = MPI_Wtime();
Starting out: Calling MPI_Wtime() before MPI_Init() is undefined. I'll also add that if you
want to do this sort of benchmark with MPI, a good idea is to put a MPI_Barrier() before
the call to Wtime so that all the tasks enter the section at the same time.
//omp_get_num_threads here returns 1??
The reason why omp_get_num_threads() returns 1 is that you are not in a
parallel region.
#pragma omp parallel num_threads(nThread)
You set num_threads to nThread here which as Hristo Iliev mentioned, effectively
ignores any input through the OMP_NUM_THREADS environment variable. You can usually just
leave num_threads out and be ok for this sort of simplified problem.
default(shared)
The behavior of variables in the parallel region is by default shared, so there's
no reason to have default(shared) here.
private(iam, i)
I guess it's your coding style, but instead of making iam and i private, you could
just declare them within the parallel region, which will automatically make them private
(and considering you don't really use them outside of it, there's not much reason not to).
#pragma omp for schedule(dynamic, 1)
Also as Hristo Iliev mentioned, using schedule(dynamic, 1) for this problem set in particular
is not the best of ideas, since each iteration of your loop takes virtually no time
and the total problem size is fixed.
int grandTotal=0;
for (int j=0;j<nThread;j++) {
printf("Total=%d\n",total[j]);
grandTotal += total[j];
}
Not necessarily an error, but your allocation of the total array and summation at the end
is better accomplished using the OpenMP reduction directive.
double t2 = MPI_Wtime();
Similar to what you did with MPI_Init(), calling MPI_Wtime() after you've
called MPI_Finalize() is undefined, and should be avoided if possible.
Note: If you are somewhat familiar with what OpenMP is, this
is a good reference and basically everything I explained here about OpenMP is in there.
With that out of the way, I have to note you didn't actually do anything with MPI,
besides output the rank and comm size. Which is to say, all the MPI tasks
do a fixed amount of work each, regardless of the number tasks. Since there's
no decrease in work-per-task for an increasing number of MPI tasks, you wouldn't expect
to have any scaling, would you? (Note: this is actually what's called Weak Scaling, but since you have no communication via MPI, there's no reason to expect it to not
scale perfectly).
Here's your code rewritten with some of the changes I mentioned:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_size,
world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int name_len;
char proc_name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name(proc_name, &name_len);
MPI_Barrier(MPI_COMM_WORLD);
double t_start = MPI_Wtime();
// we need to scale the work per task by number of mpi threads,
// otherwise we actually do more work with the more tasks we have
const int n_iterations = 1e7 / world_size;
// actually we also need some dummy data to add so the compiler doesn't just
// optimize out the work loop with -O3 on
int data[16];
for (int i = 0; i < 16; ++i)
data[i] = rand() % 16;
// reduction(+:total) means that all threads will make a private
// copy of total at the beginning of this construct and then
// do a reduction operation with the + operator at the end (aka sum them
// all together)
unsigned int total = 0;
#pragma omp parallel reduction(+:total)
{
// both of these calls will execute properly since we
// are in an omp parallel region
int n_threads = omp_get_num_threads(),
thread_id = omp_get_thread_num();
// note: this code will only execute on a single thread (per mpi task)
#pragma omp master
{
printf("nThread = %d\n", n_threads);
}
#pragma omp for
for (int i = 0; i < n_iterations; i++)
total += data[i % 16];
printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
thread_id, n_threads, world_rank, world_size, proc_name);
}
// do a reduction with MPI, otherwise the data we just calculated is useless
unsigned int grand_total;
MPI_Allreduce(&total, &grand_total, 1, MPI_UNSIGNED, MPI_SUM, MPI_COMM_WORLD);
// another barrier to make sure we wait for the slowest task
MPI_Barrier(MPI_COMM_WORLD);
double t_end = MPI_Wtime();
// output individual thread totals
printf("Thread total = %d\n", total);
// output results from a single thread
if (world_rank == 0)
{
printf("Grand Total = %d\n", grand_total);
printf("Time elapsed with MPI clock = %f\n", t_end - t_start);
}
MPI_Finalize();
return 0;
}
Another thing to note, my version of your code executed 22 times slower with schedule(dynamic, 1) added, just to show you how it can impact performance when used incorrectly.
Unfortunately I'm not too familiar with PBS, as the clusters I use run with SLURM but an example sbatch file for a job running on 3 nodes, on a system with two 6-core processors per node, might look something like this:
#!/bin/bash
#SBATCH --job-name=TestOrSomething
#SBATCH --export=ALL
#SBATCH --partition=debug
#SBATCH --nodes=3
#SBATCH --ntasks-per-socket=1
# set 6 processes per thread here
export OMP_NUM_THREADS=6
# note that this will end up running 3 * (however many cpus
# are on a single node) mpi tasks, not just 3. Additionally
# the below line might use `mpirun` instead depending on the
# cluster
srun ./a.out
For fun, I also just ran my version on a cluster to test the scaling for MPI and OMP, and got the following (note the log scales):
As you can see, its basically perfect. Actually, 1-16 is 1 MPI task with 1-16 OMP threads, and 16-256 is 1-16 MPI tasks with 16 threads per task, so you can also see that there's no change in behavior between the MPI scaling and OMP scaling.

MPI: How to start three functions which will be executed in different threads

I have 3 function and 4 cores. I want execute each function in new thread using MPI and C++
I write this
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
size--;
if (rank == 0)
{
Thread1();
}
else
{
if(rank == 1)
{
Thread2();
}
else
{
Thread3();
}
}
MPI_Finalize();
But it execute just Thread1(). How i must change code?
Thanks!
Print to screen the current value of variable size (possibly without decrementing it) and you will find 1. That is: "there is 1 process running".
You are likely running your compiled code the wrong way. Consider to use mpirun (or mpiexec, depending on your MPI implementation) to execute it, i.e.
mpirun -np 4 ./MyCompiledCode
the -np parameter specifies the number of processes you will start (doing so, your MPI_Comm_size will be 4 as you expect).
Currently, though, you are not using anything explicitly owing to C++. You can consider some C++ binding of MPI such as Boost.MPI.
I worked a little bit on the code you provided. I changed it a little bit producing this working mpi code (I provided some needed correction in capital letters).
FYI:
compilation (under gcc, mpich):
$ mpicxx -c mpi1.cpp
$ mpicxx -o mpi1 mpi1.o
execution
$ mpirun -np 4 ./mpi1
output
size is 4
size is 4
size is 4
2 function started.
thread2
3 function started.
thread3
3 function ended.
2 function ended.
size is 4
1 function started.
thread1
1 function ended.
be aware that stdout is likely messed out.
Are you sure you are compiling your code the right way?
You problem is that MPI provides no way to feed console input into many processes but only into process with rank 0. Because of the first three lines in main:
int main(int argc, char *argv[]){
int oper;
std::cout << "Enter Size:";
std::cin >> oper; // <------- The problem is right here
Operations* operations = new Operations(oper);
int rank, size;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
switch(tid)
{
all processes but rank 0 block waiting for console input which they cannot get. You should rewrite the beginning of your main function as follows:
int main(int argc, char *argv[]){
int oper;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
if (tid == 0) {
std::cout << "Enter Size:";
std::cin >> oper;
}
MPI_Bcast(&oper, 1, MPI_INT, 0, MPI_COMM_WORLD);
Operations* operations = new Operations(oper);
switch(tid)
{
It works as follows: only rank 0 displays the prompt and then reads the console input into oper. Then a broadcast of the value of oper from rank 0 is performed so all other processes obtain the correct value, create the Operations object and then branch to the appropriate function.

Multi-Threaded MPI Process Suddenly Terminating

I'm writing an MPI program (Visual Studio 2k8 + MSMPI) that uses Boost::thread to spawn two threads per MPI process, and have run into a problem I'm having trouble tracking down.
When I run the program with: mpiexec -n 2 program.exe, one of the processes suddenly terminates:
job aborted:
[ranks] message
[0] terminated
[1] process exited without calling finalize
---- error analysis -----
[1] on winblows
program.exe ended prematurely and may have crashed. exit code 0xc0000005
---- error analysis -----
I have no idea why the first process is suddenly terminating, and can't figure out how to track down the reason. This happens even if I put the rank zero process into an infinite loop at the end of all of it's operations... it just suddenly dies. My main function looks like this:
int _tmain(int argc, _TCHAR* argv[])
{
/* Initialize the MPI execution environment. */
MPI_Init(0, NULL);
/* Create the worker threads. */
boost::thread masterThread(&Master);
boost::thread slaveThread(&Slave);
/* Wait for the local test thread to end. */
masterThread.join();
slaveThread.join();
/* Shutdown. */
MPI_Finalize();
return 0;
}
Where the master and slave functions do some arbitrary work before ending. I can confirm that the master thread, at the very least, is reaching the end of it's operations. The slave thread is always the one that isn't done before the execution gets aborted. Using print statements, it seems like the slave thread isn't actually hitting any errors... it's happily moving along and just get's taken out in the crash.
So, does anyone have any ideas for:
a) What could be causing this?
b) How should I go about debugging it?
Thanks so much!
Edit:
Posting minimal versions of the Master/Slave functions. Note that the goal of this program is purely for demonstration purposes... so it isn't doing anything useful. Essentially, the master threads send a dummy payload to the slave thread of the other MPI process.
void Master()
{
int myRank;
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* Create a message with numbers 0 through 39 as the payload, addressed
* to this thread. */
int *payload= new int[40];
for(int n = 0; n < 40; n++) {
payload[n] = n;
}
if(myRank == 0) {
MPI_Send(payload, 40, MPI_INT, 1, MPI_ANY_TAG, MPI_COMM_WORLD);
} else {
MPI_Send(payload, 40, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD);
}
/* Free memory. */
delete(payload);
}
void Slave()
{
MPI_Status status;
int *payload= new int[40];
MPI_Recv(payload, 40, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
/* Free memory. */
delete(payload);
}
you have to use thread safe version of mpi runtime.
read up on MPI_Init_thread.