Simple MPI_Scatter try - c++

I am just learning OpenMPI. Tried a simple MPI_Scatter example:
#include <mpi.h>
using namespace std;
int main() {
int numProcs, rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* data;
int num;
data = new int[5];
data[0] = 0;
data[1] = 1;
data[2] = 2;
data[3] = 3;
data[4] = 4;
MPI_Scatter(data, 5, MPI_INT, &num, 5, MPI_INT, 0, MPI_COMM_WORLD);
cout << rank << " recieved " << num << endl;
MPI_Finalize();
return 0;
}
But it didn't work as expected ...
I was expecting something like
0 received 0
1 received 1
2 received 2 ...
But what I got was
32609 received
1761637486 received
1 received
33 received
1601007716 received
Whats with the weird ranks? Seems to be something to do with my scatter? Also, why is the sendcount and recvcount the same? At first I thought since I'm scattering 5 elements to 5 processors, each will get 1? So I should be using:
MPI_Scatter(data, 5, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
But this gives an error:
[JM:2861] *** An error occurred in MPI_Scatter
[JM:2861] *** on communicator MPI_COMM_WORLD
[JM:2861] *** MPI_ERR_TRUNCATE: message truncated
[JM:2861] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
if (rank == xxx) {
UPDATE
I noticed to run, send and receive buffer must be of same length ... and the data should be declared like:
int data[5][5] = { {0}, {5}, {10}, {3}, {4} };
Notice the columns is declared as length 5 but I only initialized 1 value? What is actually happening here? Is this code correct? Suppose I only want each process to receive 1 value only.

sendcount is the number of elements you want to send to each process, not the count of elements in the send buffer. MPI_Scatter will just take sendcount * [number of processes in the communicator] elements from the send buffer from the root process and scatter it to all processes in the communicator.
So to send 1 element to each of the processes in the communicator (assume there are 5 processes), set sendcount and recvcount to be 1.
MPI_Scatter(data, 1, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
There are restrictions on the possible datatype pairs and they are the same as for point-to-point operations. The type map of recvtype should be compatible with the type map of sendtype, i.e. they should have the same list of underlying basic datatypes. Also the receive buffer should be large enough to hold the received message (it might be larger, but not smaller). In most simple cases, the data type on both send and receive sides are the same. So sendcount - recvcount pair and sendtype - recvtype pair usually end up the same. An example where they can differ is when one uses user-defined datatype(s) on either side:
MPI_Datatype vec5int;
MPI_Type_contiguous(5, MPI_INT, &vec5int);
MPI_Type_commit(&vec5int);
MPI_Scatter(data, 5, MPI_INT, local_data, 1, vec5int, 0, MPI_COMM_WORLD);
This works since the sender constructs messages of 5 elements of type MPI_INT while each receiver interprets the message as a single instance of a 5-element integer vector.
(Note that you specify the maximum number of elements to be received in MPI_Recv and the actual amount received might be less, which can be obtained by MPI_Get_count. In contrast, you supply the expected number of elements to be received in recvcount of MPI_Scatter so error will be thrown if the message length received is not exactly the same as promised.)
Probably you know by now that the weird rank printed out is caused by stack corruption, since num can only contains 1 int but 5 int are received in MPI_Scatter.
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
It is necessary to differentiate between root and other processes in the communicator (they are not child process of the root since they can be in a separate computer) in some operations such as Scatter and Gather, since these are collective communication (group communication) but with a single source/destination. The single source/destination (the odd one out) is therefore called root. It is necessary for all the processes to know the source/destination (root process) to set up send and receive correctly.
The root process, in case of Scatter, will also receive a piece of data (from itself), and in case of Gather, will also include its data in the final result. There is no exception for the root process, unless "in place" operations are used. This also applies to all collective communication functions.
There are also root-less global communication operations like MPI_Allgather, where one does not provide a root rank. Rather all ranks receive the data being gathered.
All processes in the communicator will run the function (try to exclude one process in the communicator and you will get a deadlock). You can imagine processes on different computer running the same code blindly. However, since each of them may belong to different communicator group and has different rank, the function will run differently. Each process knows whether it is member of the communicator, and each knows the rank of itself and can compare to the rank of the root process (if any), so they can set up the communication or do extra actions accordingly.

Related

Passing large 2d dimentional array in MPI C++

I have a task to speed up a program using MPI.
Let's assume I have a large 2d array (1000x1000 or bigger) on the input. I have a working sequential program that divides, so the 2d array into chunks (for example 10x10) and calculates the result which is double for each chuck. (so we have a function which argument is 2d array 10x10 and a result is a double number).
My first idea to speed up:
Create 1d array of size N*N (for example 10x10 = 100) and Send array to another process
double* buffer = new double[dataPortionSize];
//copy some data to buffer
MPI_Send(buffer, dataPortionSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
Recieve it in another process, calculate result, send back the result
double* buf = new double[dataPortionSize];
MPI_Recv(buf, dataPortionSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
double result = function->calc(buf);
MPI_Send(&result, 1, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD);
This program was much slower than the sequential version. It looks like MPI needs a lot of time to pass an array to another process.
My second idea:
Pass the whole 2d input array to all processes
// data is protected field in base class, it is injected during runtime
MPI_Send(&(data[0][0]), dataSize * dataSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
And receive data like this
double **arrayAlloc( int size ) {
double **result; result = new double [ size ];
for ( int i = 0; i < size; i++ )
result[ i ] = new double[ size ];
return result;
}
double **data = arrayAlloc(dataSize);
MPI_Recv(&data[0][0], dataSize * dataSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
Unfortunately, I got a bunch of errors during execution:
Those crashes are pretty random. It happened 2 times that the program ended successfully
My third idea:
Pass memory address to all processes, but I found this:
MPI processes cannot read each others' memory, and virtual addressing makes one process' pointer completely meaningless to another.
Does anyone have an idea how to speed up it? I understand that the key thing for speed to is pass array/arrays to processes in an efficient way, but I don't have an idea how to do this.
You have multiple issues here. I'll try to go through them in some arbitrary order.
As someone else explained, your second attempt fails because MPI expects you to work with a single consecutive array, not an array of pointers. So you want to allocate something like matrix = new double[rows * cols] and then access individual rows as &matrix[row * cols] or an individual value as matrix[row * cols + col]
This would be a data structure that you can send, receive, scatter, and gather with MPI. It would also be faster in general.
You are correct to assume that MPI takes time to transfer data. Even best case it is the cost of a memcpy. Usually significantly more. If your program is doing too little work before transferring data, it will not be faster.
Your first attempt may have failed because the first process doesn't do anything useful while waiting for the result. You didn't include the receive operation in your code sample. However, if you wrote something like this:
for(int block = 0; block < nblocks; ++block) {
generate_data(buf);
MPI_Send(buf, ...);
MPI_Recv(buf, ...);
}
Then you cannot expect a speedup because the process is not doing anything useful while waiting for the result. You can avoid this with double buffering. Let the first process generate the next data block before waiting in the receive operation for the result. Something like this:
generate_data(0, input); /* 0-th block */
MPI_Send(input, ...);
for(int block = 1; block < nblocks; ++block) {
generate_data(block, input); /* 1st up to nth block */
MPI_Recv(output, ...); /* 0-th up to n-1-th block */
MPI_Send(input, ...);
}
MPI_Recv(output, ...); /* n-th block */
Now calculations in both processes can overlap.
You shouldn't use MPI_Send and MPI_Recv to begin with! MPI is designed for collective operations like MPI_Scatter and MPI_Gather. What you should do, is generate N blocks for N processes, MPI_Scatter them across all processes. Then let each process compute their result. Then MPI_Gather them back at the root process.
Even better, let every process work independently, if possible. Of course this depends on your data but if you can generate and process data blocks independently from one another, don't do any communication. Just let them all work alone. Something like this:
int rank, worldsize;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &worldsize);
for(int block = rank; block < nblocks; block += worldsize) {
process_data(block);
}

MPI dynamically allocate tasks

I have a C++ MPI program that runs on Windows HPC cluster (12 nodes, 24 cores per node).
The logic of the program is really simple:
there is a pool of tasks
At the start, the program divides the tasks equally to each MPI process
Each MPI process execute their tasks
After everything is finished, using MPI reduce to gather the results to the root process.
There is one problem. Each task can have drastically different execution time and there is no way that I can tell that in advance. Equally distributing the task will results a lot of processes waiting idle. This wastes a lot of computer resources and make the total execution time longer.
I am thinking of one solution that might work.
The process is like this.
The task pool is divided into small parcels (like 10 tasks a parcel)
Each MPI process take a parcel at a time when it is idle (have not received a parcel, or finished the previous parcel)
The step 2 is continued until the task pool is exhausted
Using MPI reduce to gather all the results to root process
As far as I understand, this scheme need a universal counter across nodes/process (to avoid different MPI process execute the same parcel) and changing it need some lock/sync mechanism. It certainly has its overhead but with proper tuning, I think it can help to improve the performance.
I am not quite familiar with MPI and have some implementation issues. I can think of two ways to implement this universal counter
Using MPI I/O technique, write this counter in file, when a parcel is took, increase this counter (will certainly need file lock mechanism)
Using MPI one side communication/shared memory. Put this counter in the shared memory and increase it when a parcel is taken. (will certainly need a sync mechanism)
Unfortunately, I am not familiar with either technique and want to explore the possibility, implementation, or possible drawbacks of the two above methods. A sample code would be greatly appreciated.
If you have other ways to solve the problem or suggestions, that will also be great. Thanks.
Follow-ups:
Thanks for all the useful suggestions. I am implemented a test program following the scheme of using process 0 as the task distributor.
#include <iostream>
#include <mpi.h>
using namespace std;
void doTask(int rank, int i){
cout<<rank<<" got task "<<i<<endl;
}
int main ()
{
int numTasks = 5000;
int parcelSize = 100;
int numParcels = (numTasks/parcelSize) + (numTasks%parcelSize==0?0:1);
//cout<<numParcels<<endl;
MPI_Init(NULL, NULL);
int rank, nproc;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Status status;
MPI_Request request;
int ready = 0;
int i = 0;
int maxParcelNow = 0;
if(rank == 0){
for(i = 0; i <numParcels; i++){
MPI_Recv(&ready, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
//cout<<i<<"Yes"<<endl;
MPI_Send(&i, 1, MPI_INT, status.MPI_SOURCE, 0, MPI_COMM_WORLD);
//cout<<i<<"No"<<endl;
}
maxParcelNow = i;
cout<<maxParcelNow<<" "<<numParcels<<endl;
}else{
int counter = 0;
while(true){
if(maxParcelNow == numParcels) {
cout<<"Yes exiting"<<endl;
break;
}
//if(maxParcelNow == numParcels - 1) break;
ready = 1;
MPI_Send(&ready, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
//cout<<rank<<"send"<<endl;
MPI_Recv(&i, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
//cout<<rank<<"recv"<<endl;
doTask(rank, i);
}
}
MPI_Bcast(&maxParcelNow, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
It does not work and it never stops. Any suggestions on how to make it work? Does this code reflect the idea right or am I missing something? Thanks
[Converting my comments into an answer...]
Given n processes, you can have your first process p0 dispatch tasks for the other n - 1 processes. First, it will do point-to-point communication to the other n - 1 processes so that everyone has work to do, and then it will block on a Recv. When any given process completes, say p3, it will send its result back to p0. At this point, p0 will send another message to p3 with one of two things:
1) Another task
or
2) Some kind of termination signal if there are no tasks remaining. (Using the 'tag' of the message is one easy way.)
Obviously, p0 will loop over that logic until there is no task left, in which case it will call MPI_Finalize too.
Unlike what you thought in your comments, this isn't round-robin. It first gives a job to every process, or worker, and then gives back another job whenever one completes...

MPI_Exchanging sides of vector between processes

In my code I have an arbitrary number of processes exchanging some parts of their local vectors. The local vectors are vectors of pairs and for this reason I've been using an MPI derived datatype. In principle I don't know how many elements each process sends to the others and for this reason I also have to send the size of the buffer. In particular, each process exchanges data with the process with rank: myrank-1 and with the process with rank: myrank+1. In case of process 0 instead of myrank-1 it exchanges with process with rank comm_size-1. And as well in case of process comm_size-1 instead of myrank+1 it exchanges with process with rank 0.
This is my code:
unsigned int size1tobesent;
size1tobesent=last.size();//Buffer size
int slpartner = (rank + 1) % p;
int rlpartner = (rank - 1 + p) % p;
unsigned int sizereceived1;
MPI_Sendrecv(&size1tobesent, 1, MPI_UNSIGNED, slpartner, 0,&sizereceived1,1,
MPI_UNSIGNED, rlpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Vect first1(sizereceived1);
MPI_Sendrecv(&last[0], last.size(), mytype, slpartner, 0,&first1[0],sizereceived1,
mytype, rlpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
unsigned int size2tobesent;
size2tobesent=first.size();//Buffer size2
unsigned int sizereceived2;
MPI_Sendrecv(&size2tobesent, 1, MPI_UNSIGNED, rlpartner, 0,
&sizereceived2,1,MPI_UNSIGNED, slpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Vect last1(sizereceived2);
MPI_Sendrecv(&first[0], first.size(), mytype, rlpartner, 0,&last1[0],
sizereceived2 ,mytype, slpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Now when I run my code with 2 or 3 processes all works as expected. With more than 3 the results are unpredictable. I don't know if this is due to a particular combination of the input data or if there are some theoretical errors that I'm missing.
Finally consider that this code is part of a for cycle.

How to use shared global datasets in MPI?

I am an MPI beginner. I have a large array gmat of numbers (type double, dimensions 1x14000000) which is precomputed and stored in a binary file. It will use approximately 100 MB in memory (14000000 x8 bytes /1024 /1024). I want to write a MPI code which will do some computation on this array (for example, multiply all elements of gmat by rank number of the process). This array gmat itself stays constant during run-time.
The code is supposed to be something like this
#include <iostream>
#include "mpi.h"
double* gmat;
long int imax;
int main(int argc, char* argv[])
{
void performcomputation(int rank); // this function performs the computation and will be called by all processes
imax=atoi(argv[1]); // user inputs the length of gmat
MPI::Init();
rank = MPI::COMM_WORLD.Get_rank();
size = MPI::COMM_WORLD.Get_size(); //i will use -np 16 = 4 processors x 4 cores
if rank==0 // read the gmat array using one of the processes
{
gmat = new double[imax];
// read values of gmat from a file
// next line is supposed to broadcast values of gmat to all processes which will use it
MPI::COMM_WORLD.Bcast(&gmat,imax,MPI::DOUBLE,1);
}
MPI::COMM_WORLD.Barrier();
performcomputation(rank);
MPI::Finalize();
return 0;
}
void performcomputation(int rank)
{
int i;
for (i=0;i <imax; i++)
cout << "the new value is" << gmat[i]*rank << endl;
}
My question is when I run this code using 16 processes (-np 16), is the gmat same for all of them ? I mean, will the code use 16 x 100 MB in memory to store gmat for each process or will it use only 100 MB since I have defined gmat to be global ? And I don't want different processes to read gmat separately from the file, since reading so many numbers takes time. What is a better way to do this ? Thanks.
First of all, please do not use the MPI C++ bindings. These were deprecated in MPI-2.2 and then deleted in MPI-3.0 and therefore no longer part of the specification, which means that future MPI implementations are not even required to provide C++ bindings, and that if they do, they will probably diverge in the way the interface looks like.
That said, your code contains a very common mistake:
if rank==0 // read the gmat array using one of the processes
{
gmat = new double[imax];
// read values of gmat from a file
// next line is supposed to broadcast values of gmat to all processes which will use it
MPI::COMM_WORLD.Bcast(&gmat,imax,MPI::DOUBLE,1);
}
This won't work as there are four errors here. First, gmat is only allocated at rank 0 and not allocated in the others ranks, which is not what you want. Second, you are giving Bcast the address of the pointer gmat and not the address of the data pointed by it (i.e. you should not use the & operator). You also broadcast from rank 0 but put 1 as the broadcast root argument. But the most important error is that MPI_BCAST is a collective communication call and all ranks are required to call it with the same value of the root argument in order for it to complete successfully. The correct code (using the C bindings instead of the C++ ones) is:
gmat = new double[imax];
if (rank == 0)
{
// read values of gmat from a file
}
MPI_Bcast(gmat, imax, MPI_DOUBLE, 0, MPI_COMM_WORLD);
// ^^^^ ^^^
// no & root == 0
Each rank has its own copy of gmat. Initially all values are different (e.g. random or all zeros, depends on the memory allocator). After the broadcast all copies will become identical to the copy of gmat at rank 0. After the call to performcomputation() each copy will be different again since each rank multiplies the elements of gmat with a different number. The answer to your question is: the code will use 100 MiB in each rank, therefore 16 x 100 MiB in total.
MPI deals with distributed memory - processes do not share variables, no matter if they are local or global ones. The only way to share data is to use MPI calls like point-to-point communication (e.g. MPI_SEND / MPI_RECV), collective calls (e.g. MPI_BCAST) or one-sided communication (e.g. MPI_PUT / MPI_GET).

Difficulty with MPI_Bcast: how to ensure that "correct" root is broadcasting

I am relatively new to MPI (with C), and am having some trouble using MPI_Bcast to send an int to all processes.
In my code, I decide which rank is root within a for loop, where different processes are responsible for different element of the loop. Then, I want to Bcast a result from root to all processes, except all non-root processes do not know who to expect the bcast from, so do not receive it.
The code block looks something like this:
for (iX=start; iX<end; iX++)
//start and end are the starting row and ending row for each processes, defined earlier
for (int iY=1; iY<nn; iY++)
// do some calculations
if (some condition)
int bcastroot = rank; // initialized above
int X = 111; // initialized above
else
//do other calculations
end
end
MPI_Bcast(&X, 1, MPI_INT, bcastroot, comm);
// remainder of code and MPI_FINALIZE
When I execute this code, whatever the bcastroot default (value for all non-root processes) competes with root, so X is not broadcast correctly. I do not know the value of X nor can I predict the root beforehand, so I cannot define it in advance.
I have tried initializing bcastroot = -1, then setting it for rank, but this does not work. Is there a way I can Bcast this value without setting root for all processes?
Thanks,
JonZor
There is no way to do an MPI_Bcast where the receivers don't know what the root is. If you know there will only be one root, you can first do an MPI_Allreduce to agree on it:
int root, maybe_root;
int i_am_root = ...;
maybe_root = (i_am_root ? rank : 0);
MPI_Allreduce(&maybe_root, &root, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
then every rank will know the same root and you can do your broadcast.