MPI code does not work with 2 nodes, but with 1

MPI code does not work with 2 nodes, but with 1 - c++

Super EDIT:
Adding the broadcast step, will result in ncols to get printed by the two processes by the master node (from which I can check the output). But why? I mean, all variables that are broadcast have already a value in the line of their declaration!!! (off-topic image).
I have some code based on this example.
I checked that cluster configuration is OK, with this simple program, which also printed the IP of the machine that it would run onto:
int main (int argc, char *argv[])
{
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
// removed code that printed IP address
MPI_Finalize();
return 0;
}
which printed every machine's IP twice.
EDIT_2
If I print (only) the grid, like in the example, I am getting for one computer:
Processes grid pattern:
0 1
2 3
and for two:
Processes grid pattern:

Executables were not the same in both nodes!
When I configured the cluster, I had a very hard time, so when I had a problem with mounting, I just skipped it. So now, the changes would appear only in one node. The code would behave weirdly, unless some (or all) of the code was the same.

Related

OpenMPI MPI_Send vs Intel MPI MPI_Send

I have a code which I compile and run using openmpi. Lately, I wanted to run this same code using Intel MPI. But my code is not working as expected.
I digged into the code and found out that MPI_Send behaves differently in both implementation.
I got the advice from the different forum to use MPI_Isend instead of MPi_Send from different forum. But that requires hell lot of work to modify the code. Is there any workaround in Intel MPI to make it work just like in OpenMPI. May be some Flags or Increasing Buffer or something else. Thanks in advance for your answers.
int main(int argc, char **argv) {
int numRanks;
int rank;
char cmd[] = "Hello world";
MPI_Status status;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numRanks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if(rank == 0) {
for (int i=0; i< numRanks; i++) {
printf("Calling MPI_Send() from rank %d to %d\n", rank, i);
MPI_Send(&cmd,sizeof(cmd),MPI_CHAR,i,MPI_TAG,MPI_COMM_WORLD);
printf("Returned from MPI_Send()\n");
}
}
MPI_Recv(&cmd,sizeof(cmd),MPI_CHAR,0,MPI_TAG,MPI_COMM_WORLD,&status);
printf("%d receieved from 0 %s\n", rank, cmd);
MPI_Finalize();
}
OpenMPI Result
# mpirun --allow-run-as-root -n 2 helloworld_openmpi
Calling MPI_Send() from rank 0 to 0
Returned from MPI_Send()
Calling MPI_Send() from rank 0 to 1
Returned from MPI_Send()
0 receieved from 0 Hello world
1 receieved from 0 Hello world
Intel MPI Result
# mpiexec.hydra -n 2 /root/helloworld_intel
Calling MPI_Send() from rank 0 to 0
Stuck at MPI_Send.

It is incorrect to assume MPI_Send() will return before a matching receive is posted, so your code is incorrect with respect to the MPI Standard, and you are lucky it did not hang with Open MPI.
MPI implementation usually eager-send small messages so MPI_Send() can return immediately, but this is an implementation choice not mandated by the standard, and "small" message depends on the library version, the interconnect you are using and other factors.
The only safe and portable choice here is to write correct code.
FWIW, MPI_Bcast(cmd, ...) is a better fit here, assuming all ranks already know the string length plus the NUL terminator.
Last but not least, the buffer argument is cmd and not &cmd.

MPI_Comm_Spawn called multiple times

We are writing a code to solve non linear problem using an iterative method (Newton). Anyway, the problem is that we don't know a priori how many MPI processes will be needed from one iteration to another, due to e.g. remeshing, adaptivity, etc. And there is quite a lot of iterations...
We hence would like to use MPI_Comm_Spawn at each iteration to create as much MPI process as we need, gather the results and "destroy" the subprocesses. We know this limits the scalability of the code due to the gathering of information, however, we have been asked to do it :)
I did a couple of tests of MPI_Comm_Spawn on my laptop (on windows 7/64bit) using intel MPI and Visual Studio express 2013. I tried these simple codes
//StackMain
#include <iostream>
#include <mpi.h>
#include<vector>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
for (int i = 0; i < 10000; i++)
{
std::cout << "Loop number "<< i << std::endl;
MPI_Comm children;
std::vector<int> err(4);
ierr = MPI_Comm_spawn("StackWorkers.exe", NULL, 4, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &children, &err[0]);
MPI_Barrier(children);
MPI_Comm_disconnect(&children);
}
ierr = MPI_Finalize();
return 0;
}
And the program launched by the spawned processes:
//StackWorkers
#include <mpi.h>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
MPI_Comm parent;
ierr = MPI_Comm_get_parent(&parent);
MPI_Barrier(parent);
ierr = MPI_Finalize();
return 0;
}
The program is launched using one MPI process:
mpiexec -np 1 StackMain.exe
It seems to work, I do have however some questions...
1- The program freezes during iteration 4096, this number do not change if I relaunch the program. If during each iteration I launch 2 times 4 process, then it will stop at iteration 2048th...
Is it a limitation from the operating system ?
2- When I look at the memory occupied by "mpiexec" during the program, it grows continuously (never going down). Do you know why ? I though that, when subprocess finnished their job, they would release the memory they used...
3- Should I disconnect/free the children communicator or not ? If yes, MPI_Disconnect(...) must be called on both spawned and spawnee processes ? Or only spawnee ?
Thanks a lot!

Simple MPI_Scatter try

I am just learning OpenMPI. Tried a simple MPI_Scatter example:
#include <mpi.h>
using namespace std;
int main() {
int numProcs, rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* data;
int num;
data = new int[5];
data[0] = 0;
data[1] = 1;
data[2] = 2;
data[3] = 3;
data[4] = 4;
MPI_Scatter(data, 5, MPI_INT, &num, 5, MPI_INT, 0, MPI_COMM_WORLD);
cout << rank << " recieved " << num << endl;
MPI_Finalize();
return 0;
}
But it didn't work as expected ...
I was expecting something like
0 received 0
1 received 1
2 received 2 ...
But what I got was
32609 received
1761637486 received
1 received
33 received
1601007716 received
Whats with the weird ranks? Seems to be something to do with my scatter? Also, why is the sendcount and recvcount the same? At first I thought since I'm scattering 5 elements to 5 processors, each will get 1? So I should be using:
MPI_Scatter(data, 5, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
But this gives an error:
[JM:2861] *** An error occurred in MPI_Scatter
[JM:2861] *** on communicator MPI_COMM_WORLD
[JM:2861] *** MPI_ERR_TRUNCATE: message truncated
[JM:2861] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
if (rank == xxx) {
UPDATE
I noticed to run, send and receive buffer must be of same length ... and the data should be declared like:
int data[5][5] = { {0}, {5}, {10}, {3}, {4} };
Notice the columns is declared as length 5 but I only initialized 1 value? What is actually happening here? Is this code correct? Suppose I only want each process to receive 1 value only.

sendcount is the number of elements you want to send to each process, not the count of elements in the send buffer. MPI_Scatter will just take sendcount * [number of processes in the communicator] elements from the send buffer from the root process and scatter it to all processes in the communicator.
So to send 1 element to each of the processes in the communicator (assume there are 5 processes), set sendcount and recvcount to be 1.
MPI_Scatter(data, 1, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
There are restrictions on the possible datatype pairs and they are the same as for point-to-point operations. The type map of recvtype should be compatible with the type map of sendtype, i.e. they should have the same list of underlying basic datatypes. Also the receive buffer should be large enough to hold the received message (it might be larger, but not smaller). In most simple cases, the data type on both send and receive sides are the same. So sendcount - recvcount pair and sendtype - recvtype pair usually end up the same. An example where they can differ is when one uses user-defined datatype(s) on either side:
MPI_Datatype vec5int;
MPI_Type_contiguous(5, MPI_INT, &vec5int);
MPI_Type_commit(&vec5int);
MPI_Scatter(data, 5, MPI_INT, local_data, 1, vec5int, 0, MPI_COMM_WORLD);
This works since the sender constructs messages of 5 elements of type MPI_INT while each receiver interprets the message as a single instance of a 5-element integer vector.
(Note that you specify the maximum number of elements to be received in MPI_Recv and the actual amount received might be less, which can be obtained by MPI_Get_count. In contrast, you supply the expected number of elements to be received in recvcount of MPI_Scatter so error will be thrown if the message length received is not exactly the same as promised.)
Probably you know by now that the weird rank printed out is caused by stack corruption, since num can only contains 1 int but 5 int are received in MPI_Scatter.
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
It is necessary to differentiate between root and other processes in the communicator (they are not child process of the root since they can be in a separate computer) in some operations such as Scatter and Gather, since these are collective communication (group communication) but with a single source/destination. The single source/destination (the odd one out) is therefore called root. It is necessary for all the processes to know the source/destination (root process) to set up send and receive correctly.
The root process, in case of Scatter, will also receive a piece of data (from itself), and in case of Gather, will also include its data in the final result. There is no exception for the root process, unless "in place" operations are used. This also applies to all collective communication functions.
There are also root-less global communication operations like MPI_Allgather, where one does not provide a root rank. Rather all ranks receive the data being gathered.
All processes in the communicator will run the function (try to exclude one process in the communicator and you will get a deadlock). You can imagine processes on different computer running the same code blindly. However, since each of them may belong to different communicator group and has different rank, the function will run differently. Each process knows whether it is member of the communicator, and each knows the rank of itself and can compare to the rank of the root process (if any), so they can set up the communication or do extra actions accordingly.

MPI: How to start three functions which will be executed in different threads

I have 3 function and 4 cores. I want execute each function in new thread using MPI and C++
I write this
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
size--;
if (rank == 0)
{
Thread1();
}
else
{
if(rank == 1)
{
Thread2();
}
else
{
Thread3();
}
}
MPI_Finalize();
But it execute just Thread1(). How i must change code?
Thanks!

Print to screen the current value of variable size (possibly without decrementing it) and you will find 1. That is: "there is 1 process running".
You are likely running your compiled code the wrong way. Consider to use mpirun (or mpiexec, depending on your MPI implementation) to execute it, i.e.
mpirun -np 4 ./MyCompiledCode
the -np parameter specifies the number of processes you will start (doing so, your MPI_Comm_size will be 4 as you expect).
Currently, though, you are not using anything explicitly owing to C++. You can consider some C++ binding of MPI such as Boost.MPI.
I worked a little bit on the code you provided. I changed it a little bit producing this working mpi code (I provided some needed correction in capital letters).
FYI:
compilation (under gcc, mpich):
$ mpicxx -c mpi1.cpp
$ mpicxx -o mpi1 mpi1.o
execution
$ mpirun -np 4 ./mpi1
output
size is 4
size is 4
size is 4
2 function started.
thread2
3 function started.
thread3
3 function ended.
2 function ended.
size is 4
1 function started.
thread1
1 function ended.
be aware that stdout is likely messed out.
Are you sure you are compiling your code the right way?

You problem is that MPI provides no way to feed console input into many processes but only into process with rank 0. Because of the first three lines in main:
int main(int argc, char *argv[]){
int oper;
std::cout << "Enter Size:";
std::cin >> oper; // <------- The problem is right here
Operations* operations = new Operations(oper);
int rank, size;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
switch(tid)
{
all processes but rank 0 block waiting for console input which they cannot get. You should rewrite the beginning of your main function as follows:
int main(int argc, char *argv[]){
int oper;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
if (tid == 0) {
std::cout << "Enter Size:";
std::cin >> oper;
}
MPI_Bcast(&oper, 1, MPI_INT, 0, MPI_COMM_WORLD);
Operations* operations = new Operations(oper);
switch(tid)
{
It works as follows: only rank 0 displays the prompt and then reads the console input into oper. Then a broadcast of the value of oper from rank 0 is performed so all other processes obtain the correct value, create the Operations object and then branch to the appropriate function.

Multi-Threaded MPI Process Suddenly Terminating

I'm writing an MPI program (Visual Studio 2k8 + MSMPI) that uses Boost::thread to spawn two threads per MPI process, and have run into a problem I'm having trouble tracking down.
When I run the program with: mpiexec -n 2 program.exe, one of the processes suddenly terminates:
job aborted:
[ranks] message
[0] terminated
[1] process exited without calling finalize
---- error analysis -----
[1] on winblows
program.exe ended prematurely and may have crashed. exit code 0xc0000005
---- error analysis -----
I have no idea why the first process is suddenly terminating, and can't figure out how to track down the reason. This happens even if I put the rank zero process into an infinite loop at the end of all of it's operations... it just suddenly dies. My main function looks like this:
int _tmain(int argc, _TCHAR* argv[])
{
/* Initialize the MPI execution environment. */
MPI_Init(0, NULL);
/* Create the worker threads. */
boost::thread masterThread(&Master);
boost::thread slaveThread(&Slave);
/* Wait for the local test thread to end. */
masterThread.join();
slaveThread.join();
/* Shutdown. */
MPI_Finalize();
return 0;
}
Where the master and slave functions do some arbitrary work before ending. I can confirm that the master thread, at the very least, is reaching the end of it's operations. The slave thread is always the one that isn't done before the execution gets aborted. Using print statements, it seems like the slave thread isn't actually hitting any errors... it's happily moving along and just get's taken out in the crash.
So, does anyone have any ideas for:
a) What could be causing this?
b) How should I go about debugging it?
Thanks so much!
Edit:
Posting minimal versions of the Master/Slave functions. Note that the goal of this program is purely for demonstration purposes... so it isn't doing anything useful. Essentially, the master threads send a dummy payload to the slave thread of the other MPI process.
void Master()
{
int myRank;
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* Create a message with numbers 0 through 39 as the payload, addressed
* to this thread. */
int *payload= new int[40];
for(int n = 0; n < 40; n++) {
payload[n] = n;
}
if(myRank == 0) {
MPI_Send(payload, 40, MPI_INT, 1, MPI_ANY_TAG, MPI_COMM_WORLD);
} else {
MPI_Send(payload, 40, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD);
}
/* Free memory. */
delete(payload);
}
void Slave()
{
MPI_Status status;
int *payload= new int[40];
MPI_Recv(payload, 40, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
/* Free memory. */
delete(payload);
}

you have to use thread safe version of mpi runtime.
read up on MPI_Init_thread.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js