MPI_Barrier doesn't function properly

MPI_Barrier doesn't function properly - c++

I wrote the C application below to help me understand MPI, and why MPI_Barrier() isn't functioning in my huge C++ application. However, I was able to reproduce my problem in my huge application with a much smaller C application. Essentially, I call MPI_Barrier() inside a for loop, and MPI_Barrier() is visible to all nodes, yet after 2 iterations of the loop, the program becomes deadlocked. Any thoughts?
#include <mpi.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int i=0, numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
printf("%s: Rank %d of %d\n", processor_name, rank, numprocs);
for(i=1; i <= 100; i++) {
if (rank==0) printf("Before barrier (%d:%s)\n",i,processor_name);
MPI_Barrier(MPI_COMM_WORLD);
if (rank==0) printf("After barrier (%d:%s)\n",i,processor_name);
}
MPI_Finalize();
return 0;
}
The output:
alienone: Rank 1 of 4
alienfive: Rank 3 of 4
alienfour: Rank 2 of 4
alientwo: Rank 0 of 4
Before barrier (1:alientwo)
After barrier (1:alientwo)
Before barrier (2:alientwo)
After barrier (2:alientwo)
Before barrier (3:alientwo)
I am using GCC 4.4, Open MPI 1.3 from the Ubuntu 10.10 repositories.
Also, in my huge C++ application, MPI Broadcasts don't work. Only half the nodes receive the broadcast, the others are stuck waiting for it.
Thank you in advance for any help or insights!
Update: Upgraded to Open MPI 1.4.4, compiled from source into /usr/local/.
Update: Attaching GDB to the running process shows an interesting result. It looks to me that the MPI system died at the barrier, but MPI still thinks the program is running:
Attaching GDB yields an interesting result. It seems all nodes have died at the MPI barrier, but MPI still thinks they are running:
0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
83 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0 0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1 0x00007fc236a45141 in poll_dispatch () from /usr/local/lib/libopen-pal.so.0
#2 0x00007fc236a43f89 in opal_event_base_loop () from /usr/local/lib/libopen-pal.so.0
#3 0x00007fc236a38119 in opal_progress () from /usr/local/lib/libopen-pal.so.0
#4 0x00007fc236eff525 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.0
#5 0x00007fc23141ad76 in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
#6 0x00007fc2314247ce in ompi_coll_tuned_barrier_intra_recursivedoubling () from /usr/local/lib/openmpi/mca_coll_tuned.so
#7 0x00007fc236f15f12 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0
#8 0x0000000000400b32 in main (argc=1, argv=0x7fff5883da58) at barrier_test.c:14
(gdb)
Update:
I also have this code:
#include <mpi.h>
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] ) {
int n = 400, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf("MPI Rank %i of %i.\n", myid, numprocs);
while (1) {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));
}
MPI_Finalize();
return 0;
}
And despite the infinite loop, there is only one output from the printf() in the loop:
mpirun -n 24 --machinefile /etc/machines a.out
MPI Rank 0 of 24.
MPI Rank 3 of 24.
MPI Rank 1 of 24.
MPI Rank 4 of 24.
MPI Rank 17 of 24.
MPI Rank 15 of 24.
MPI Rank 5 of 24.
MPI Rank 7 of 24.
MPI Rank 16 of 24.
MPI Rank 2 of 24.
MPI Rank 11 of 24.
MPI Rank 9 of 24.
MPI Rank 8 of 24.
MPI Rank 20 of 24.
MPI Rank 23 of 24.
MPI Rank 19 of 24.
MPI Rank 12 of 24.
MPI Rank 13 of 24.
MPI Rank 21 of 24.
MPI Rank 6 of 24.
MPI Rank 10 of 24.
MPI Rank 18 of 24.
MPI Rank 22 of 24.
MPI Rank 14 of 24.
pi is approximately 3.1415931744231269, Error is 0.0000005208333338
Any thoughts?

MPI_Barrier() in OpenMPI sometimes hangs when processes come across the barrier at different times passed after last barrier, however that's not your case as I can see. Anyway, try using MPI_Reduce() instead or before the real call to MPI_Barrier(). This is not a direct equivalent to barrier, but any synchronous call with almost no payload involving all processes in a communicator should work like a barrier. I haven't seen such behavior of MPI_Barrier() in LAM/MPI or MPICH2 or even WMPI, but it was a real issue with OpenMPI.

What interconnect do you have? Is it a specialisied one like InfiniBand or Myrinet or are you just using plain TCP over Ethernet? Do you have more than one configured network interfaces if running with the TCP transport?
Besides, Open MPI is modular -- there are many modules that provide algorithms implementing the various collective operations. You can try to fiddle with them using MCA parameters, e.g. you can start debugging your application's behaviour with increasing the verbosity of the btl component by passing mpirun something like --mca btl_base_verbose 30. Look for something similar to:
[node1:19454] btl: tcp: attempting to connect() to address 192.168.2.2 on port 260
[node2:29800] btl: tcp: attempting to connect() to address 192.168.2.1 on port 260
[node1:19454] btl: tcp: attempting to connect() to address 192.168.109.1 on port 260
[node1][[54886,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.109.1 failed: Connection timed out (110)
In that case some (or all) nodes have more than one configured network interface that is up but not all nodes are reachable through all the interfaces. This might happen, e.g. if nodes run recent Linux distro with per-default enabled Xen support (RHEL?) or have other virtualisation software installed on them that brings up virtual network interfaces.
By default Open MPI is lazy, that is connectinos are opened on demand. The first send/receive communication may succeed if the right interface is picked up, but subsequent operations are likely to pick up one of the alternate paths in order to maximise the bandwidth. If the other node is unreachable through the second interface a time out is likely to occur and the communication will fail as Open MPI will consider the other node down or problematic.
The solution is to isolate the non-connecting networks or network interfaces using MCA parameters of the TCP btl module:
force Open MPI to use only a specific IP network for communication: --mca btl_tcp_if_include 192.168.2.0/24
force Open MPI to use only some of the network interfaces that are known to provide full network connectivity: --mca btl_tcp_if_include eth0,eth1
force Open MPI to not use network interfaces that are known to be private/virtual or to belong to other networks that do not connect the nodes (if you choose to do so, you must exclude the loopback lo): --mca btl_tcp_if_exclude lo,virt0
Refer to the Open MPI run-time TCP tuning FAQ for more details.

Related

OpenMPI MPI_Send vs Intel MPI MPI_Send

I have a code which I compile and run using openmpi. Lately, I wanted to run this same code using Intel MPI. But my code is not working as expected.
I digged into the code and found out that MPI_Send behaves differently in both implementation.
I got the advice from the different forum to use MPI_Isend instead of MPi_Send from different forum. But that requires hell lot of work to modify the code. Is there any workaround in Intel MPI to make it work just like in OpenMPI. May be some Flags or Increasing Buffer or something else. Thanks in advance for your answers.
int main(int argc, char **argv) {
int numRanks;
int rank;
char cmd[] = "Hello world";
MPI_Status status;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numRanks);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if(rank == 0) {
for (int i=0; i< numRanks; i++) {
printf("Calling MPI_Send() from rank %d to %d\n", rank, i);
MPI_Send(&cmd,sizeof(cmd),MPI_CHAR,i,MPI_TAG,MPI_COMM_WORLD);
printf("Returned from MPI_Send()\n");
}
}
MPI_Recv(&cmd,sizeof(cmd),MPI_CHAR,0,MPI_TAG,MPI_COMM_WORLD,&status);
printf("%d receieved from 0 %s\n", rank, cmd);
MPI_Finalize();
}
OpenMPI Result
# mpirun --allow-run-as-root -n 2 helloworld_openmpi
Calling MPI_Send() from rank 0 to 0
Returned from MPI_Send()
Calling MPI_Send() from rank 0 to 1
Returned from MPI_Send()
0 receieved from 0 Hello world
1 receieved from 0 Hello world
Intel MPI Result
# mpiexec.hydra -n 2 /root/helloworld_intel
Calling MPI_Send() from rank 0 to 0
Stuck at MPI_Send.

It is incorrect to assume MPI_Send() will return before a matching receive is posted, so your code is incorrect with respect to the MPI Standard, and you are lucky it did not hang with Open MPI.
MPI implementation usually eager-send small messages so MPI_Send() can return immediately, but this is an implementation choice not mandated by the standard, and "small" message depends on the library version, the interconnect you are using and other factors.
The only safe and portable choice here is to write correct code.
FWIW, MPI_Bcast(cmd, ...) is a better fit here, assuming all ranks already know the string length plus the NUL terminator.
Last but not least, the buffer argument is cmd and not &cmd.

MPI code does not work with 2 nodes, but with 1

Super EDIT:
Adding the broadcast step, will result in ncols to get printed by the two processes by the master node (from which I can check the output). But why? I mean, all variables that are broadcast have already a value in the line of their declaration!!! (off-topic image).
I have some code based on this example.
I checked that cluster configuration is OK, with this simple program, which also printed the IP of the machine that it would run onto:
int main (int argc, char *argv[])
{
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf( "Hello world from process %d of %d\n", rank, size );
// removed code that printed IP address
MPI_Finalize();
return 0;
}
which printed every machine's IP twice.
EDIT_2
If I print (only) the grid, like in the example, I am getting for one computer:
Processes grid pattern:
0 1
2 3
and for two:
Processes grid pattern:

Executables were not the same in both nodes!
When I configured the cluster, I had a very hard time, so when I had a problem with mounting, I just skipped it. So now, the changes would appear only in one node. The code would behave weirdly, unless some (or all) of the code was the same.

Strange behavior when mixing openMP with openMPI

I have some code that is parallelized using openMP (on a for loop). I wanted to now repeat the functionality several times and use MPI to submit to a cluster of machines, keeping the intra node stuff to all still be openMP.
When I only use openMP, I get the speed up I expect (using twice the number of processors/cores finishes in half the time). When I add in the MPI and submit to only one MPI process, I do not get this speed up. I created a toy problem to check this and still have the same issue. Here is the code
#include <iostream>
#include <stdio.h>
#include <unistd.h>
#include "mpi.h"
#include <omp.h>
int main(int argc, char *argv[]) {
int iam=0, np = 1;
long i;
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double t1 = MPI_Wtime();
std::cout << "!!!Hello World!!!" << std::endl; // prints !!!Hello World!!!
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
int nThread = omp_get_num_procs();//omp_get_num_threads here returns 1??
printf("nThread = %d\n", nThread);
int *total = new int[nThread];
for (int j=0;j<nThread;j++) {
total[j]=0;
}
#pragma omp parallel num_threads(nThread) default(shared) private(iam, i)
{
np = omp_get_num_threads();
#pragma omp for schedule(dynamic, 1)
for (i=0; i<10000000; i++) {
iam = omp_get_thread_num();
total[iam]++;
}
printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
iam, np, rank, numprocs,processor_name);
}
int grandTotal=0;
for (int j=0;j<nThread;j++) {
printf("Total=%d\n",total[j]);
grandTotal += total[j];
}
printf("GrandTotal= %d\n", grandTotal);
MPI_Finalize();
double t2 = MPI_Wtime();
printf("time elapsed with MPI clock=%f\n", t2-t1);
return 0;
}
I am compiling with openmpi-1.8/bin/mpic++, using the -fopenmp flag. Here is my PBS script
#PBS -l select=1:ncpus=12
setenv OMP_NUM_THREADS 12
/util/mpi/openmpi-1.8/bin/mpirun -np 1 -hostfile $PBS_NODEFILE --map-by node:pe=$OMP_NUM_THREADS /workspace/HelloWorldMPI/HelloWorldMPI
I have also tried with #PBS -l nodes=1:ppn=12, get the same results.
When using half the cores, the program is actually faster (twice as fast!). When I reduce the number of cores, I change both ncpus and OMP_NUM_THREADS. I have tried increasing the actual work (adding 10^10 numbers instead of 10^7 shown here in the code). I have tried removing the printf statements wondering if they were somehow slowing things down, still have the same problem. Top shows that I am using all the CPUs (as set in ncpus) close to 100%. If I submit with -np=2, it parallelizes beautifully on two machines, so the MPI seems to be working as expected, but the openMP is broken
Out of ideas now, anything I can try. What am I doing wrong?

I hate to say it, but there's a lot wrong and you should probably just familiarize
yourself more with OpenMP and MPI. Nevertheless, I'll try to go through your code
and point out the errors I saw.
double t1 = MPI_Wtime();
Starting out: Calling MPI_Wtime() before MPI_Init() is undefined. I'll also add that if you
want to do this sort of benchmark with MPI, a good idea is to put a MPI_Barrier() before
the call to Wtime so that all the tasks enter the section at the same time.
//omp_get_num_threads here returns 1??
The reason why omp_get_num_threads() returns 1 is that you are not in a
parallel region.
#pragma omp parallel num_threads(nThread)
You set num_threads to nThread here which as Hristo Iliev mentioned, effectively
ignores any input through the OMP_NUM_THREADS environment variable. You can usually just
leave num_threads out and be ok for this sort of simplified problem.
default(shared)
The behavior of variables in the parallel region is by default shared, so there's
no reason to have default(shared) here.
private(iam, i)
I guess it's your coding style, but instead of making iam and i private, you could
just declare them within the parallel region, which will automatically make them private
(and considering you don't really use them outside of it, there's not much reason not to).
#pragma omp for schedule(dynamic, 1)
Also as Hristo Iliev mentioned, using schedule(dynamic, 1) for this problem set in particular
is not the best of ideas, since each iteration of your loop takes virtually no time
and the total problem size is fixed.
int grandTotal=0;
for (int j=0;j<nThread;j++) {
printf("Total=%d\n",total[j]);
grandTotal += total[j];
}
Not necessarily an error, but your allocation of the total array and summation at the end
is better accomplished using the OpenMP reduction directive.
double t2 = MPI_Wtime();
Similar to what you did with MPI_Init(), calling MPI_Wtime() after you've
called MPI_Finalize() is undefined, and should be avoided if possible.
Note: If you are somewhat familiar with what OpenMP is, this
is a good reference and basically everything I explained here about OpenMP is in there.
With that out of the way, I have to note you didn't actually do anything with MPI,
besides output the rank and comm size. Which is to say, all the MPI tasks
do a fixed amount of work each, regardless of the number tasks. Since there's
no decrease in work-per-task for an increasing number of MPI tasks, you wouldn't expect
to have any scaling, would you? (Note: this is actually what's called Weak Scaling, but since you have no communication via MPI, there's no reason to expect it to not
scale perfectly).
Here's your code rewritten with some of the changes I mentioned:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_size,
world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int name_len;
char proc_name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name(proc_name, &name_len);
MPI_Barrier(MPI_COMM_WORLD);
double t_start = MPI_Wtime();
// we need to scale the work per task by number of mpi threads,
// otherwise we actually do more work with the more tasks we have
const int n_iterations = 1e7 / world_size;
// actually we also need some dummy data to add so the compiler doesn't just
// optimize out the work loop with -O3 on
int data[16];
for (int i = 0; i < 16; ++i)
data[i] = rand() % 16;
// reduction(+:total) means that all threads will make a private
// copy of total at the beginning of this construct and then
// do a reduction operation with the + operator at the end (aka sum them
// all together)
unsigned int total = 0;
#pragma omp parallel reduction(+:total)
{
// both of these calls will execute properly since we
// are in an omp parallel region
int n_threads = omp_get_num_threads(),
thread_id = omp_get_thread_num();
// note: this code will only execute on a single thread (per mpi task)
#pragma omp master
{
printf("nThread = %d\n", n_threads);
}
#pragma omp for
for (int i = 0; i < n_iterations; i++)
total += data[i % 16];
printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
thread_id, n_threads, world_rank, world_size, proc_name);
}
// do a reduction with MPI, otherwise the data we just calculated is useless
unsigned int grand_total;
MPI_Allreduce(&total, &grand_total, 1, MPI_UNSIGNED, MPI_SUM, MPI_COMM_WORLD);
// another barrier to make sure we wait for the slowest task
MPI_Barrier(MPI_COMM_WORLD);
double t_end = MPI_Wtime();
// output individual thread totals
printf("Thread total = %d\n", total);
// output results from a single thread
if (world_rank == 0)
{
printf("Grand Total = %d\n", grand_total);
printf("Time elapsed with MPI clock = %f\n", t_end - t_start);
}
MPI_Finalize();
return 0;
}
Another thing to note, my version of your code executed 22 times slower with schedule(dynamic, 1) added, just to show you how it can impact performance when used incorrectly.
Unfortunately I'm not too familiar with PBS, as the clusters I use run with SLURM but an example sbatch file for a job running on 3 nodes, on a system with two 6-core processors per node, might look something like this:
#!/bin/bash
#SBATCH --job-name=TestOrSomething
#SBATCH --export=ALL
#SBATCH --partition=debug
#SBATCH --nodes=3
#SBATCH --ntasks-per-socket=1
# set 6 processes per thread here
export OMP_NUM_THREADS=6
# note that this will end up running 3 * (however many cpus
# are on a single node) mpi tasks, not just 3. Additionally
# the below line might use `mpirun` instead depending on the
# cluster
srun ./a.out
For fun, I also just ran my version on a cluster to test the scaling for MPI and OMP, and got the following (note the log scales):
As you can see, its basically perfect. Actually, 1-16 is 1 MPI task with 1-16 OMP threads, and 16-256 is 1-16 MPI tasks with 16 threads per task, so you can also see that there's no change in behavior between the MPI scaling and OMP scaling.

Simple MPI_Scatter try

I am just learning OpenMPI. Tried a simple MPI_Scatter example:
#include <mpi.h>
using namespace std;
int main() {
int numProcs, rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* data;
int num;
data = new int[5];
data[0] = 0;
data[1] = 1;
data[2] = 2;
data[3] = 3;
data[4] = 4;
MPI_Scatter(data, 5, MPI_INT, &num, 5, MPI_INT, 0, MPI_COMM_WORLD);
cout << rank << " recieved " << num << endl;
MPI_Finalize();
return 0;
}
But it didn't work as expected ...
I was expecting something like
0 received 0
1 received 1
2 received 2 ...
But what I got was
32609 received
1761637486 received
1 received
33 received
1601007716 received
Whats with the weird ranks? Seems to be something to do with my scatter? Also, why is the sendcount and recvcount the same? At first I thought since I'm scattering 5 elements to 5 processors, each will get 1? So I should be using:
MPI_Scatter(data, 5, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
But this gives an error:
[JM:2861] *** An error occurred in MPI_Scatter
[JM:2861] *** on communicator MPI_COMM_WORLD
[JM:2861] *** MPI_ERR_TRUNCATE: message truncated
[JM:2861] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
if (rank == xxx) {
UPDATE
I noticed to run, send and receive buffer must be of same length ... and the data should be declared like:
int data[5][5] = { {0}, {5}, {10}, {3}, {4} };
Notice the columns is declared as length 5 but I only initialized 1 value? What is actually happening here? Is this code correct? Suppose I only want each process to receive 1 value only.

sendcount is the number of elements you want to send to each process, not the count of elements in the send buffer. MPI_Scatter will just take sendcount * [number of processes in the communicator] elements from the send buffer from the root process and scatter it to all processes in the communicator.
So to send 1 element to each of the processes in the communicator (assume there are 5 processes), set sendcount and recvcount to be 1.
MPI_Scatter(data, 1, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
There are restrictions on the possible datatype pairs and they are the same as for point-to-point operations. The type map of recvtype should be compatible with the type map of sendtype, i.e. they should have the same list of underlying basic datatypes. Also the receive buffer should be large enough to hold the received message (it might be larger, but not smaller). In most simple cases, the data type on both send and receive sides are the same. So sendcount - recvcount pair and sendtype - recvtype pair usually end up the same. An example where they can differ is when one uses user-defined datatype(s) on either side:
MPI_Datatype vec5int;
MPI_Type_contiguous(5, MPI_INT, &vec5int);
MPI_Type_commit(&vec5int);
MPI_Scatter(data, 5, MPI_INT, local_data, 1, vec5int, 0, MPI_COMM_WORLD);
This works since the sender constructs messages of 5 elements of type MPI_INT while each receiver interprets the message as a single instance of a 5-element integer vector.
(Note that you specify the maximum number of elements to be received in MPI_Recv and the actual amount received might be less, which can be obtained by MPI_Get_count. In contrast, you supply the expected number of elements to be received in recvcount of MPI_Scatter so error will be thrown if the message length received is not exactly the same as promised.)
Probably you know by now that the weird rank printed out is caused by stack corruption, since num can only contains 1 int but 5 int are received in MPI_Scatter.
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
It is necessary to differentiate between root and other processes in the communicator (they are not child process of the root since they can be in a separate computer) in some operations such as Scatter and Gather, since these are collective communication (group communication) but with a single source/destination. The single source/destination (the odd one out) is therefore called root. It is necessary for all the processes to know the source/destination (root process) to set up send and receive correctly.
The root process, in case of Scatter, will also receive a piece of data (from itself), and in case of Gather, will also include its data in the final result. There is no exception for the root process, unless "in place" operations are used. This also applies to all collective communication functions.
There are also root-less global communication operations like MPI_Allgather, where one does not provide a root rank. Rather all ranks receive the data being gathered.
All processes in the communicator will run the function (try to exclude one process in the communicator and you will get a deadlock). You can imagine processes on different computer running the same code blindly. However, since each of them may belong to different communicator group and has different rank, the function will run differently. Each process knows whether it is member of the communicator, and each knows the rank of itself and can compare to the rank of the root process (if any), so they can set up the communication or do extra actions accordingly.

MPI: How to start three functions which will be executed in different threads

I have 3 function and 4 cores. I want execute each function in new thread using MPI and C++
I write this
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
size--;
if (rank == 0)
{
Thread1();
}
else
{
if(rank == 1)
{
Thread2();
}
else
{
Thread3();
}
}
MPI_Finalize();
But it execute just Thread1(). How i must change code?
Thanks!

Print to screen the current value of variable size (possibly without decrementing it) and you will find 1. That is: "there is 1 process running".
You are likely running your compiled code the wrong way. Consider to use mpirun (or mpiexec, depending on your MPI implementation) to execute it, i.e.
mpirun -np 4 ./MyCompiledCode
the -np parameter specifies the number of processes you will start (doing so, your MPI_Comm_size will be 4 as you expect).
Currently, though, you are not using anything explicitly owing to C++. You can consider some C++ binding of MPI such as Boost.MPI.
I worked a little bit on the code you provided. I changed it a little bit producing this working mpi code (I provided some needed correction in capital letters).
FYI:
compilation (under gcc, mpich):
$ mpicxx -c mpi1.cpp
$ mpicxx -o mpi1 mpi1.o
execution
$ mpirun -np 4 ./mpi1
output
size is 4
size is 4
size is 4
2 function started.
thread2
3 function started.
thread3
3 function ended.
2 function ended.
size is 4
1 function started.
thread1
1 function ended.
be aware that stdout is likely messed out.
Are you sure you are compiling your code the right way?

You problem is that MPI provides no way to feed console input into many processes but only into process with rank 0. Because of the first three lines in main:
int main(int argc, char *argv[]){
int oper;
std::cout << "Enter Size:";
std::cin >> oper; // <------- The problem is right here
Operations* operations = new Operations(oper);
int rank, size;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
switch(tid)
{
all processes but rank 0 block waiting for console input which they cannot get. You should rewrite the beginning of your main function as follows:
int main(int argc, char *argv[]){
int oper;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
if (tid == 0) {
std::cout << "Enter Size:";
std::cin >> oper;
}
MPI_Bcast(&oper, 1, MPI_INT, 0, MPI_COMM_WORLD);
Operations* operations = new Operations(oper);
switch(tid)
{
It works as follows: only rank 0 displays the prompt and then reads the console input into oper. Then a broadcast of the value of oper from rank 0 is performed so all other processes obtain the correct value, create the Operations object and then branch to the appropriate function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js