OpenMPI MPI_Send vs Intel MPI MPI_Send - c++

I have a code which I compile and run using openmpi. Lately, I wanted to run this same code using Intel MPI. But my code is not working as expected.
I digged into the code and found out that MPI_Send behaves differently in both implementation.
I got the advice from the different forum to use MPI_Isend instead of MPi_Send from different forum. But that requires hell lot of work to modify the code. Is there any workaround in Intel MPI to make it work just like in OpenMPI. May be some Flags or Increasing Buffer or something else. Thanks in advance for your answers.
int main(int argc, char **argv) {
int numRanks;
int rank;
char cmd[] = "Hello world";
MPI_Status status;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numRanks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if(rank == 0) {
for (int i=0; i< numRanks; i++) {
printf("Calling MPI_Send() from rank %d to %d\n", rank, i);
MPI_Send(&cmd,sizeof(cmd),MPI_CHAR,i,MPI_TAG,MPI_COMM_WORLD);
printf("Returned from MPI_Send()\n");
}
}
MPI_Recv(&cmd,sizeof(cmd),MPI_CHAR,0,MPI_TAG,MPI_COMM_WORLD,&status);
printf("%d receieved from 0 %s\n", rank, cmd);
MPI_Finalize();
}
OpenMPI Result
# mpirun --allow-run-as-root -n 2 helloworld_openmpi
Calling MPI_Send() from rank 0 to 0
Returned from MPI_Send()
Calling MPI_Send() from rank 0 to 1
Returned from MPI_Send()
0 receieved from 0 Hello world
1 receieved from 0 Hello world
Intel MPI Result
# mpiexec.hydra -n 2 /root/helloworld_intel
Calling MPI_Send() from rank 0 to 0
Stuck at MPI_Send.

It is incorrect to assume MPI_Send() will return before a matching receive is posted, so your code is incorrect with respect to the MPI Standard, and you are lucky it did not hang with Open MPI.
MPI implementation usually eager-send small messages so MPI_Send() can return immediately, but this is an implementation choice not mandated by the standard, and "small" message depends on the library version, the interconnect you are using and other factors.
The only safe and portable choice here is to write correct code.
FWIW, MPI_Bcast(cmd, ...) is a better fit here, assuming all ranks already know the string length plus the NUL terminator.
Last but not least, the buffer argument is cmd and not &cmd.

Related

Limitation of data exchange using MPI c++

I am writing a MPI C++ code for data exchange, below is the sample code:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int size, rank;
int dest, tag, i;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("SIZE = %d RANK = %d\n",size,rank);
if ( rank == 0 )
{
double data[500];
for (i = 0; i < 500; i++)
{
data[i] = i;
}
dest = 1;
tag = 1;
MPI_Send(data,500,MPI_DOUBLE,dest,tag, MPI_COMM_WORLD );
}
MPI_Finalize();
return(0);
}
Looks like 500 is the maximum that I can send. If the number of data increases to 600, the code seems to stop at "MPI_SEND" without further progress. I am suspecting is there any limitation for the data be transferred using MPI_SEND. Can you someone enlighten me?
Thanks in advance,
Kan
Long story short, your program is incorrect and you are lucky it did not hang with a small count.
Per the MPI standard, you cannot assume MPI_Send() will return unless a matching MPI_Recv() has been posted.
From a pragmatic point of view, "short" messages are generally sent in eager mode, and MPI_Send() likely returns immediately.
On the other hand, "long" messages usually involve a rendez-vous protocol, and hence hang until a matching receive have been posted.
"small" and "long" depend on several factor, including the interconnect you are using. And once again, you cannot assume MPI_Send() will always return immediately if you send messages that are small enough.

Why does MPI_Recv fail when there is an accumulation of MPI_Send calls

I have an MPI program in which worker ranks (rank != 0) make a bunch of MPI_Send calls, and the master rank (rank == 0) receives all these messages. However, I run into a Fatal error in MPI_Recv - MPI_Recv(...) failed, Out of memory.
Here is the code that I am compiling in Visual Studio 2010.
I run the executable like so:
mpiexec -n 3 MPIHelloWorld.exe
int main(int argc, char* argv[]){
int numprocs, rank, namelen, num_threads, thread_id;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
if(rank == 0){
for(int k=1; k<numprocs; k++){
for(int i=0; i<1000000; i++){
double x;
MPI_Recv(&x, 1, MPI_DOUBLE, k, i, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
}
}
else{
for(int i=0; i<1000000; i++){
double x = 5;
MPI_Send(&x, 1, MPI_DOUBLE, 0, i, MPI_COMM_WORLD);
}
}
}
If I run with only 2 processes, the program does not crash. So it seems like the problem is when there is an accumulation of the MPI_Send calls from a third rank (aka a second worker node).
If I decrease the number of iterations to 100,000 then I can run with 3 processes without crashing. However, the amount of data being sent with one million iterations is ~ 8 MB (8 bytes for double * 1000000 iterations), so I don't think the "Out of Memory" is referring to any physical memory like RAM.
Any insight is appreciated, thanks!
The MPI_send operation stores the data on the system buffer ready to send. The size of this buffer and where it is stored is implementation specific (I remember hearing that this can even be in the interconnects). In my case (linux with mpich) I don't get a memory error. One way to explicitly change this buffer is to use MPI_buffer_attach with MPI_Bsend. There may also be a way to change the system buffer size (e.g. MP_BUFFER_MEM system variable on IBM systems).
However that this situation of unrequited messages should probably not occur in practice. In your example above, the order of the k and i loops could be swapped to prevent this build up of messages.

MPI code does not work with 2 nodes, but with 1

Super EDIT:
Adding the broadcast step, will result in ncols to get printed by the two processes by the master node (from which I can check the output). But why? I mean, all variables that are broadcast have already a value in the line of their declaration!!! (off-topic image).
I have some code based on this example.
I checked that cluster configuration is OK, with this simple program, which also printed the IP of the machine that it would run onto:
int main (int argc, char *argv[])
{
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
// removed code that printed IP address
MPI_Finalize();
return 0;
}
which printed every machine's IP twice.
EDIT_2
If I print (only) the grid, like in the example, I am getting for one computer:
Processes grid pattern:
0 1
2 3
and for two:
Processes grid pattern:
Executables were not the same in both nodes!
When I configured the cluster, I had a very hard time, so when I had a problem with mounting, I just skipped it. So now, the changes would appear only in one node. The code would behave weirdly, unless some (or all) of the code was the same.

MPI: How to start three functions which will be executed in different threads

I have 3 function and 4 cores. I want execute each function in new thread using MPI and C++
I write this
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
size--;
if (rank == 0)
{
Thread1();
}
else
{
if(rank == 1)
{
Thread2();
}
else
{
Thread3();
}
}
MPI_Finalize();
But it execute just Thread1(). How i must change code?
Thanks!
Print to screen the current value of variable size (possibly without decrementing it) and you will find 1. That is: "there is 1 process running".
You are likely running your compiled code the wrong way. Consider to use mpirun (or mpiexec, depending on your MPI implementation) to execute it, i.e.
mpirun -np 4 ./MyCompiledCode
the -np parameter specifies the number of processes you will start (doing so, your MPI_Comm_size will be 4 as you expect).
Currently, though, you are not using anything explicitly owing to C++. You can consider some C++ binding of MPI such as Boost.MPI.
I worked a little bit on the code you provided. I changed it a little bit producing this working mpi code (I provided some needed correction in capital letters).
FYI:
compilation (under gcc, mpich):
$ mpicxx -c mpi1.cpp
$ mpicxx -o mpi1 mpi1.o
execution
$ mpirun -np 4 ./mpi1
output
size is 4
size is 4
size is 4
2 function started.
thread2
3 function started.
thread3
3 function ended.
2 function ended.
size is 4
1 function started.
thread1
1 function ended.
be aware that stdout is likely messed out.
Are you sure you are compiling your code the right way?
You problem is that MPI provides no way to feed console input into many processes but only into process with rank 0. Because of the first three lines in main:
int main(int argc, char *argv[]){
int oper;
std::cout << "Enter Size:";
std::cin >> oper; // <------- The problem is right here
Operations* operations = new Operations(oper);
int rank, size;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
switch(tid)
{
all processes but rank 0 block waiting for console input which they cannot get. You should rewrite the beginning of your main function as follows:
int main(int argc, char *argv[]){
int oper;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
if (tid == 0) {
std::cout << "Enter Size:";
std::cin >> oper;
}
MPI_Bcast(&oper, 1, MPI_INT, 0, MPI_COMM_WORLD);
Operations* operations = new Operations(oper);
switch(tid)
{
It works as follows: only rank 0 displays the prompt and then reads the console input into oper. Then a broadcast of the value of oper from rank 0 is performed so all other processes obtain the correct value, create the Operations object and then branch to the appropriate function.

MPI_Barrier doesn't function properly

I wrote the C application below to help me understand MPI, and why MPI_Barrier() isn't functioning in my huge C++ application. However, I was able to reproduce my problem in my huge application with a much smaller C application. Essentially, I call MPI_Barrier() inside a for loop, and MPI_Barrier() is visible to all nodes, yet after 2 iterations of the loop, the program becomes deadlocked. Any thoughts?
#include <mpi.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int i=0, numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
printf("%s: Rank %d of %d\n", processor_name, rank, numprocs);
for(i=1; i <= 100; i++) {
if (rank==0) printf("Before barrier (%d:%s)\n",i,processor_name);
MPI_Barrier(MPI_COMM_WORLD);
if (rank==0) printf("After barrier (%d:%s)\n",i,processor_name);
}
MPI_Finalize();
return 0;
}
The output:
alienone: Rank 1 of 4
alienfive: Rank 3 of 4
alienfour: Rank 2 of 4
alientwo: Rank 0 of 4
Before barrier (1:alientwo)
After barrier (1:alientwo)
Before barrier (2:alientwo)
After barrier (2:alientwo)
Before barrier (3:alientwo)
I am using GCC 4.4, Open MPI 1.3 from the Ubuntu 10.10 repositories.
Also, in my huge C++ application, MPI Broadcasts don't work. Only half the nodes receive the broadcast, the others are stuck waiting for it.
Thank you in advance for any help or insights!
Update: Upgraded to Open MPI 1.4.4, compiled from source into /usr/local/.
Update: Attaching GDB to the running process shows an interesting result. It looks to me that the MPI system died at the barrier, but MPI still thinks the program is running:
Attaching GDB yields an interesting result. It seems all nodes have died at the MPI barrier, but MPI still thinks they are running:
0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
83 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0 0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1 0x00007fc236a45141 in poll_dispatch () from /usr/local/lib/libopen-pal.so.0
#2 0x00007fc236a43f89 in opal_event_base_loop () from /usr/local/lib/libopen-pal.so.0
#3 0x00007fc236a38119 in opal_progress () from /usr/local/lib/libopen-pal.so.0
#4 0x00007fc236eff525 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.0
#5 0x00007fc23141ad76 in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
#6 0x00007fc2314247ce in ompi_coll_tuned_barrier_intra_recursivedoubling () from /usr/local/lib/openmpi/mca_coll_tuned.so
#7 0x00007fc236f15f12 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0
#8 0x0000000000400b32 in main (argc=1, argv=0x7fff5883da58) at barrier_test.c:14
(gdb)
Update:
I also have this code:
#include <mpi.h>
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] ) {
int n = 400, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf("MPI Rank %i of %i.\n", myid, numprocs);
while (1) {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));
}
MPI_Finalize();
return 0;
}
And despite the infinite loop, there is only one output from the printf() in the loop:
mpirun -n 24 --machinefile /etc/machines a.out
MPI Rank 0 of 24.
MPI Rank 3 of 24.
MPI Rank 1 of 24.
MPI Rank 4 of 24.
MPI Rank 17 of 24.
MPI Rank 15 of 24.
MPI Rank 5 of 24.
MPI Rank 7 of 24.
MPI Rank 16 of 24.
MPI Rank 2 of 24.
MPI Rank 11 of 24.
MPI Rank 9 of 24.
MPI Rank 8 of 24.
MPI Rank 20 of 24.
MPI Rank 23 of 24.
MPI Rank 19 of 24.
MPI Rank 12 of 24.
MPI Rank 13 of 24.
MPI Rank 21 of 24.
MPI Rank 6 of 24.
MPI Rank 10 of 24.
MPI Rank 18 of 24.
MPI Rank 22 of 24.
MPI Rank 14 of 24.
pi is approximately 3.1415931744231269, Error is 0.0000005208333338
Any thoughts?
MPI_Barrier() in OpenMPI sometimes hangs when processes come across the barrier at different times passed after last barrier, however that's not your case as I can see. Anyway, try using MPI_Reduce() instead or before the real call to MPI_Barrier(). This is not a direct equivalent to barrier, but any synchronous call with almost no payload involving all processes in a communicator should work like a barrier. I haven't seen such behavior of MPI_Barrier() in LAM/MPI or MPICH2 or even WMPI, but it was a real issue with OpenMPI.
What interconnect do you have? Is it a specialisied one like InfiniBand or Myrinet or are you just using plain TCP over Ethernet? Do you have more than one configured network interfaces if running with the TCP transport?
Besides, Open MPI is modular -- there are many modules that provide algorithms implementing the various collective operations. You can try to fiddle with them using MCA parameters, e.g. you can start debugging your application's behaviour with increasing the verbosity of the btl component by passing mpirun something like --mca btl_base_verbose 30. Look for something similar to:
[node1:19454] btl: tcp: attempting to connect() to address 192.168.2.2 on port 260
[node2:29800] btl: tcp: attempting to connect() to address 192.168.2.1 on port 260
[node1:19454] btl: tcp: attempting to connect() to address 192.168.109.1 on port 260
[node1][[54886,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.109.1 failed: Connection timed out (110)
In that case some (or all) nodes have more than one configured network interface that is up but not all nodes are reachable through all the interfaces. This might happen, e.g. if nodes run recent Linux distro with per-default enabled Xen support (RHEL?) or have other virtualisation software installed on them that brings up virtual network interfaces.
By default Open MPI is lazy, that is connectinos are opened on demand. The first send/receive communication may succeed if the right interface is picked up, but subsequent operations are likely to pick up one of the alternate paths in order to maximise the bandwidth. If the other node is unreachable through the second interface a time out is likely to occur and the communication will fail as Open MPI will consider the other node down or problematic.
The solution is to isolate the non-connecting networks or network interfaces using MCA parameters of the TCP btl module:
force Open MPI to use only a specific IP network for communication: --mca btl_tcp_if_include 192.168.2.0/24
force Open MPI to use only some of the network interfaces that are known to provide full network connectivity: --mca btl_tcp_if_include eth0,eth1
force Open MPI to not use network interfaces that are known to be private/virtual or to belong to other networks that do not connect the nodes (if you choose to do so, you must exclude the loopback lo): --mca btl_tcp_if_exclude lo,virt0
Refer to the Open MPI run-time TCP tuning FAQ for more details.