MPI virtual topology design - c++

I have been trying to create a star topology using MPI_Comm_split but I seem to have and issue when I try to establish the links withing all processes. The processes are allo expected to link to p0 of MPI_COMM_WORLD . Problem is I get a crash in the line
error=MPI_Intercomm_create( MPI_COMM_WORLD, 0, NEW_COMM, 0 ,create_tag, &INTERCOMM );
The error is : MPI_ERR_COMM: invalid communicator .
I have and idea of the cause though I don't know how to fix it . It seems this is due to a call by process zero which doesn't belong to the new communicator(NEW_COMM) . I have tried to put an if statement to stop execution of this line if process = 0, but this again fails since its a collective call.
Any suggestions would be appreciated .
#include <iostream>
#include "mpi.h"
using namespace std;
int main(){
MPI_Comm NEW_COMM , INTERCOMM;
MPI_Init(NULL,NULL);
int world_rank , world_size,new_size, error;
error = MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
error = MPI_Comm_size(MPI_COMM_WORLD,&world_size);
int color = MPI_UNDEFINED;
if ( world_rank > 0 )
color = world_rank ;
error = MPI_Comm_split(MPI_COMM_WORLD, color , world_rank, &NEW_COMM);
int new_rank;
if ( world_rank > 0 ) {
error = MPI_Comm_rank( NEW_COMM , &new_rank);
error = MPI_Comm_size(NEW_COMM, &new_size);
}
int create_tag = 99;
error=MPI_Intercomm_create( MPI_COMM_WORLD, 0, NEW_COMM, 0 ,create_tag, &INTERCOMM );
if ( world_rank > 0 )
cout<<" My Rank in WORLD = "<< world_rank <<" New rank = "<<new_rank << " size of NEWCOMM = "<<new_size <<endl;
else
cout<<" Am centre "<<endl;
MPI_Finalize();
return 0;
}

What about using a MPI topology rather? Something like this:
#include <mpi.h>
#include <iostream>
int main( int argc, char *argv[] ) {
MPI_Init( &argc, &argv );
int rank, size;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
int indegree, outdegree, *sources, *sourceweights, *destinations, *destweights;
if ( rank == 0 ) { //centre of the star
indegree = outdegree = size - 1;
sources = new int[size - 1];
sourceweights = new int[size - 1];
destinations = new int[size - 1];
destweights = new int[size - 1];
for ( int i = 0; i < size - 1; i++ ) {
sources[i] = destinations[i] = i + 1;
sourceweights[i] = destweights[i] = 1;
}
}
else { // tips of the star
indegree = outdegree = 1;
sources = new int[1];
sourceweights = new int[1];
destinations = new int[1];
destweights = new int[1];
sources[0] = destinations[0] = 0;
sourceweights[0] = destweights[0] = 1;
}
MPI_Comm star;
MPI_Dist_graph_create_adjacent( MPI_COMM_WORLD, indegree, sources, sourceweights,
outdegree, destinations, destweights, MPI_INFO_NULL,
true, &star );
delete[] sources;
delete[] sourceweights;
delete[] destinations;
delete[] destweights;
int starrank;
MPI_Comm_rank( star, &starrank );
std::cout << "Process #" << rank << " of MPI_COMM_WORLD is process #" << starrank << " of the star\n";
MPI_Comm_free( &star);
MPI_Finalize();
return 0;
}
Is that the sort of thing you were after? If not, what is your communicator for?
EDIT: Explanation about MPI topologies
I wanted to clarify that, even if this graph communicator is presented as such, it is no different to MPI_COMM_WORLD in most aspects. Notably, it comprises the whole set of MPI processes initially present in MPI_COMM_WORLD. Indeed, although its star shape has been defined and we didn't represent any link between process #1 and process #2 for example, nothing prevents you from making a point to point communication between these two processes. Simply, by defining this graph topology, you give an indication of the sort of communication pattern your code will expose. Then you ask the library to try to reorder the ranks on the physical nodes, for coming up with a possibly better match between the physical layout of your machine / network, and the needs you express. This can be done internally by an algorithm minimising a cost function using a simulated annealing method for example, but this is costly. Moreover, this supposes that the actual layout of the network is available somewhere for the library to use (which isn't the case most of the time). So at the end of the day, most of the time, this placement optimisation phase is just ignored and you end-up with the same indexes as the ones you entered... Only do I know about some meshed / torus shaped network-based machines to actually perform the placement phase for MPI_Car_create(), but maybe I'm outdated on that.
Anyway, the bottom line is that I understand you want to play with communicators for learning, but don't expect too much out of them. The best thing to learn here is how to get the ones you want in the least and simplest possible calls, which is I hope what I proposed.

Related

MPI RMA from multiple threads

In my application I am implementing what I would call a "reduce on sparse vectors" via a TBB flow graph compared with MPI one-sided communication (RMA). The central piece of the algorithm looks as follows:
auto &reduce = m_g_R.add<function_node<ReductionJob, ReductionJob>>(
serial,
[=, &reduced_bi](ReductionJob rj) noexcept
{
const auto r = std::get<0>(rj);
auto *buffer = std::get<1>(rj)->data.data();
auto &mask = std::get<1>(rj)->mask;
if (m_R_comms[r] != MPI_COMM_NULL)
{
const size_t n = reduced_bi.dim(r);
MPI_Win win;
MPI_Win_create(
buffer,
r == mr ? n * sizeof(T) : 0,
sizeof(T),
MPI_INFO_NULL,
m_R_comms[r],
&win
);
if (n > 0 && r != mr)
{
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
size_t i = 0;
do
{
while (i < n && !mask[i]) ++i;
size_t base = i;
while (i < n && mask[i]) ++i;
if (i > base) MPI_Accumulate(
buffer + base, i - base, MpiType<T>::type,
0,
base, i - base, MpiType<T>::type,
MPI_SUM,
win
);
}
while (i < n);
MPI_Win_unlock(0, win);
}
MPI_Win_free(&win);
}
return rj;
}
);
This is executed for each rank r participating in the calculation, with reduced_bi.dim(r) specifying how many elements each rank owns. mr is the current rank, and the communicators are created in such a way that the target process is root for each of them. buffer is an array of T = double (typically), and mask is an std::vector<bool> identifying which elements are non-zero. The combination of loops splits the communication into chunks of non-zero elements.
This generally works fine and results are correct, same as my previous implementation using MPI_Reduce. However, is seems to be crucial that the concurrency level for this node is set to serial, indicating that there is at most one parallel TBB task (and thus at most one thread) executing this code.
I would like to set it to unlimited to improve performance, and indeed that works fine that way on my laptop with small jobs, running with MPICH 3.4.1. On the cluster where I really want to run the computation, however, using OpenMPI 4.1.1, it runs for a while before crashing with a segfault and a backtrace involving a bunch of UCX functions.
I wonder now, is it not allowed to have multiple threads in parallel call RMA operations like this (and on my laptop it only works accidentally), or am I hitting a bug/limitation on the cluster? From the documentation I do not see directly that what I would like to do is not supported.
Of course, MPI is initialized with MPI_THREAD_MULTIPLE and I repeat again that the snippet as posted above works fine, only when I change serial --> unlimited to enable concurrent execution do I hit the problem on the cluster.
In reply to Victor Eijkhout comment(s) below, here is a complete sample program that reproduces the issue. This runs fine on my laptop (tested specifically with mpirun -n 16), but it crashes on the cluster when I run it with 16 ranks (spread across 4 cluster nodes).
#include <iostream>
#include <vector>
#include <thread>
#include <mpi.h>
int main(void)
{
int requested = MPI_THREAD_MULTIPLE, provided;
MPI_Init_thread(nullptr, nullptr, requested, &provided);
if (provided != requested)
{
std::cerr << "Failed to initialize MPI with full thread support!"
<< std::endl;
exit(1);
}
int mr, nr;
MPI_Comm_rank(MPI_COMM_WORLD, &mr);
MPI_Comm_size(MPI_COMM_WORLD, &nr);
const size_t dim = 1024;
const size_t repeat = 100;
std::vector<double> send(dim, static_cast<double>(mr) + 1.0);
std::vector<double> recv(dim, 0.0);
MPI_Win win;
MPI_Win_create(
recv.data(),
recv.size() * sizeof(double),
sizeof(double),
MPI_INFO_NULL,
MPI_COMM_WORLD,
&win
);
std::vector<std::thread> threads;
for (size_t i = 0; i < repeat; ++i)
{
threads.clear();
threads.reserve(nr);
for (int r = 0; r < nr; ++r) if (r != mr)
{
threads.emplace_back([r, &send, &win]
{
MPI_Win_lock(MPI_LOCK_SHARED, r, 0, win);
for (size_t i = 0; i < dim; ++i) MPI_Accumulate(
send.data() + i, 1, MPI_DOUBLE,
r,
i, 1, MPI_DOUBLE,
MPI_SUM,
win
);
MPI_Win_unlock(r, win);
});
}
for (auto &t : threads) t.join();
MPI_Barrier(MPI_COMM_WORLD);
if (mr == 0) std::cout << recv.front() << std::endl;
}
MPI_Win_free(&win);
MPI_Finalize();
}
Note: I am intentionally using plain threads here to avoid unnecessary dependencies. It should be linked with -lpthread.
The specific error I get on the cluster is this, using OpenMPI 4.1.1:
*** An error occurred in MPI_Accumulate
*** reported by process [1829189442,11]
*** on win ucx window 3
*** MPI_ERR_RMA_SYNC: error executing rma sync
*** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
*** and potentially your MPI job)
Possible relevant parts from ompi_info:
Open MPI: 4.1.1
Open MPI repo revision: v4.1.1
Open MPI release date: Apr 24, 2021
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes)
It has been compiled with UCX/1.10.1.
The style in C++ is to put the * or & with the type, not the identifier. This is called out specifically near the beginning of Stroustrup’s first book, and is an intentional difference from C style.
Create -- Lock -- Unlock -- Free
⧺R.1 Manage resources automatically using resource handles and RAII (Resource Acquisition Is Initialization)
Use a wrapper class, either written for the purpose, designed for this C API, a general purpose resource manager template, or unique_ptr with a custom deleter, rather than explicit calls that must be matched up for correct behavior.
RAII/RFID is one of C++'s fundamental strengths, and using it will go a long way to making your code less buggy and more maintainable in time.
Use destructuring syntax.
const auto r = std::get<0>(rj);
auto *buffer = std::get<1>(rj)->data.data();
auto &mask = std::get<1>(rj)->mask;
rather than referring to get<0> and get<1> you can immediately name the components.
const auto& [r, fred] = rj;
auto* buffer = fred->data.data();
auto& mask = fred->mask;

MPI error in communicating with many to many processors

I am writing a code, where each processor must interact with multiple processors.
Ex: I have 12 processors, so Processor 0 has to communicate to say 1,2,10 and 9. Lets call them as neighbours of Processor 0. Similarly I have
Processor 1 has to communicate to say 5 ,3.
Processor 2 has to communicate to 5,1,0,10,11
and so on.
The flow of data is 2 ways, i.e Processor 0 must send data to 1,2,10 and 9 and also receive data from them.
Also, there is no problem in Tag calculation.
I have created a code which works like this:
for(all neighbours)
{
store data in vector<double> x;
MPI_Send(x)
}
MPI_BARRIER();
for(all neighbours)
{
MPI_Recv(x);
do work with x
}
Now I testing this algorithm for different size of x and different arrangement of neighbours. The code works for some, but doesnot work for others, it simply resorts to deadlock.
I have also tried:
for(all neighbours)
{
store data in vector<double> x;
MPI_ISend(x)
}
MPI_Test();
for(all neighbours)
{
MPI_Recv(x);
do work with x
}
The result is same, although the deadlock is replcaed by NaN in result, as MPI_Test() tells me that some of the MPI_Isend() operation are not complete and it jumps immediately to MPI_Recv().
Can anyone guide me in this matter, what am I dong wrong? Or is my fundamental approach itself is incorrect?
EDIT: I am attaching code snippet for better understanding of the problem. I am basically workin on parallelizing an unstructured 3D-CFD solver
I have attached one of the files, with some explanation. I am not broadcasting, I am looping over the neighbours of the parent processor to send the data across the interface( this can be defined as a boundary between two interfaces) .
So, If say I have 12 processors, and say Processor 0 has to communicate to say 1,2,10 and 9. So 0 is the parent processor and 1,2,10 and 9 are its neighbours.
As the file was too long and a part of the solver, to make things simple, I have only kept the MPI function in it.
void Reader::MPI_InitializeInterface_Values() {
double nbr_interface_id;
Interface *interface;
MPI_Status status;
MPI_Request send_request, recv_request;
int err, flag;
int err2;
char buffer[MPI_MAX_ERROR_STRING];
int len;
int count;
for (int zone_no = 0; zone_no<this->GetNumberOfZones(); zone_no++) { // Number of zone per processor is 1, so basically each zone is an independent processor
UnstructuredGrid *zone = this->ZoneList[zone_no];
int no_of_interface = zone->GetNumberOfInterfaces();
// int count;
long int count_send = 0;
long int count_recv = 0;
long int max_size = 10000; // can be set from test case later
int max_size2 = 199;
int proc_no = FlowSolution::processor_number;
for (int interface_no = 0; interface_no < no_of_interface; interface_no++) { // interface is defined as a boundary between two zones
interface = zone->GetInterface(interface_no);
int no_faces = interface->GetNumberOfFaces();
if (no_faces != 0) {
std::vector< double > Variable_send; // The vector which stores the data to be sent across the interface
std::vector< double > Variable_recieve;
int total_size = FlowSolution::VariableOrder.size() * no_faces;
Variable_send.resize(total_size);
Variable_recieve.resize(total_size);
int nbr_proc_no = zone->GetInterface(interface_no)->GetNeighborZoneId(); // neighbour of parent processor
int j = 0;
nbr_interface_id = interface->GetShared_Interface_ID();
for (std::map<VARIABLE, int>::iterator iterator = FlowSolution::VariableOrder.begin(); iterator != FlowSolution::VariableOrder.end(); iterator++) {
for (int face_no = 0; face_no < no_faces; face_no++) {
Face *face = interface->GetFace(face_no);
int owner_id = face->Getinterface_Original_face_owner_id();
double value_send = zone->GetInterface(interface_no)->GetFace(face_no)->GetCell(owner_id)->GetPresentFlowSolution()->GetVariableValue((*iterator).first);
Variable_send[j] = value_send;
j++;
}
}
count_send = nbr_proc_no * max_size + nbr_interface_id; // tag for data to be sent
err2 = MPI_Isend(&Variable_send.front(), total_size, MPI_DOUBLE, nbr_proc_no, count_send, MPI_COMM_WORLD, &send_request);
}// end of sending
} // all the processors have sent data to their corresponding neighbours
MPI_Barrier(MPI_COMM_WORLD);
for (int interface_no = 0; interface_no < no_of_interface; interface_no++) { // loop over of neighbours of the current processor to receive data
interface = zone->GetInterface(interface_no);
int no_faces = interface->GetNumberOfFaces();
if (no_faces != 0) {
std::vector< double > Variable_recieve; // The vector which collects the data sent across the interface from
int total_size = FlowSolution::VariableOrder.size() * no_faces;
Variable_recieve.resize(total_size);
count_recv = proc_no * max_size + interface_no; // tag to receive data
int nbr_proc_no = zone->GetInterface(interface_no)->GetNeighborZoneId();
nbr_interface_id = interface->GetShared_Interface_ID();
MPI_Irecv(&Variable_recieve.front(), total_size, MPI_DOUBLE, nbr_proc_no, count_recv, MPI_COMM_WORLD, &recv_request);
/* Now some work is done using received data */
int j = 0;
for (std::map<VARIABLE, int>::iterator iterator = FlowSolution::VariableOrder.begin(); iterator != FlowSolution::VariableOrder.end(); iterator++) {
for (int face_no = 0; face_no < no_faces; face_no++) {
double value_recieve = Variable_recieve[j];
j++;
Face *face = interface->GetFace(face_no);
int owner_id = face->Getinterface_Original_face_owner_id();
interface->GetFictitiousCell(face_no)->GetPresentFlowSolution()->SetVariableValue((*iterator).first, value_recieve);
double value1 = face->GetCell(owner_id)->GetPresentFlowSolution()->GetVariableValue((*iterator).first);
double face_value = 0.5 * (value1 + value_recieve);
interface->GetFace(face_no)->GetPresentFlowSolution()->SetVariableValue((*iterator).first, face_value);
}
}
// Variable_recieve.clear();
}
}// end of receiving
}
Working from the problem statement:
Processor 0 has to send to 1, 2, 9 and 10, and receive from them.
Processor 1 has to send to 5 and 3, and receive from them.
Processor 2 has to send to 0, 1, 5, 10 and 11, and receive from them.
There are 12 total processors.
You can make life easier if you just run a 12-step program:
Step 1: Processor 0 sends, others receive as needed, then the converse occurs.
Step 2: Processor 1 sends, others receive as needed, then the converse occurs.
...
Step 12: Profit - there's nothing left to do (because every other processor has already interacted with Processor 11).
Each step can be implemented as an MPI_Scatterv (some sendcounts will be zero), followed by an MPI_Gatherv. 22 total calls and you're done.
There may be several possible reasons for a deadlock, so you have to be more specific, e. g. standard says: "When standard send operations are used, then a deadlock situation may occur where both processes are blocked because buffer space is not available."
You should use both Isend and Irecv. The general structure should be:
MPI_Request req[n];
MPI_Irecv(..., req[0]);
// ...
MPI_Irecv(..., req[n-1]);
MPI_Isend(..., req[0]);
// ...
MPI_Isend(..., req[n-1]);
MPI_Waitall(n, req, MPI_STATUSES_IGNORE);
By using AllGatherV, the problem can be solved. All I did was made the send count such that the send count only had the processors that I wanted to communicate with. Other processors had 0 send count.
This solved my problem
Thank you everyone for your answers!

Simple MPI_Gather test with memcpy error

I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.
The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.

MPI how to receive dynamic arrays from slave nodes?

I am new to MPI. I want to send three ints to three slave nodes to create dynamic arrays, and each arrays will be send back to master. According to this post, I modified the code, and it's close to the right answer. But I got breakpoint when received array from slave #3 (m ==3) in receiver code. Thank you in advance!
My code is as follow:
#include <mpi.h>
#include <iostream>
#include <stdlib.h>
int main(int argc, char** argv)
{
int firstBreakPt, lateralBreakPt;
//int reMatNum1, reMatNum2;
int tmpN;
int breakPt[3][2]={{3,5},{6,9},{4,7}};
int myid, numprocs;
MPI_Status status;
// double *reMat1;
// double *reMat2;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
tmpN = 15;
if (myid==0)
{
// send three parameters to slaves;
for (int i=1;i<numprocs;i++)
{
MPI_Send(&tmpN,1,MPI_INT,i,0,MPI_COMM_WORLD);
firstBreakPt = breakPt[i-1][0];
lateralBreakPt = breakPt[i-1][1];
//std::cout<<i<<" "<<breakPt[i-1][0] <<" "<<breakPt[i-1][1]<<std::endl;
MPI_Send(&firstBreakPt,1,MPI_INT,i,1,MPI_COMM_WORLD);
MPI_Send(&lateralBreakPt,1,MPI_INT,i,2,MPI_COMM_WORLD);
}
// receive arrays from slaves;
for (int m =1; m<numprocs; m++)
{
MPI_Probe(m, 3, MPI_COMM_WORLD, &status);
int nElems3, nElems4;
MPI_Get_elements(&status, MPI_DOUBLE, &nElems3);
// Allocate buffer of appropriate size
double *result3 = new double[nElems3];
MPI_Recv(result3,nElems3,MPI_DOUBLE,m,3,MPI_COMM_WORLD,&status);
std::cout<<"Tag is 3, ID is "<<m<<std::endl;
for (int ii=0;ii<nElems3;ii++)
{
std::cout<<result3[ii]<<std::endl;
}
MPI_Probe(m, 4, MPI_COMM_WORLD, &status);
MPI_Get_elements(&status, MPI_DOUBLE, &nElems4);
// Allocate buffer of appropriate size
double *result4 = new double[nElems4];
MPI_Recv(result4,nElems4,MPI_DOUBLE,m,4,MPI_COMM_WORLD,&status);
std::cout<<"Tag is 4, ID is "<<m<<std::endl;
for (int ii=0;ii<nElems4;ii++)
{
std::cout<<result4[ii]<<std::endl;
}
}
}
else
{
// receive three paramters from master;
MPI_Recv(&tmpN,1,MPI_INT,0,0,MPI_COMM_WORLD,&status);
MPI_Recv(&firstBreakPt,1,MPI_INT,0,1,MPI_COMM_WORLD,&status);
MPI_Recv(&lateralBreakPt,1,MPI_INT,0,2,MPI_COMM_WORLD,&status);
// width
int width1 = (rand() % (tmpN-firstBreakPt+1))+ firstBreakPt;
int width2 = (rand() % (tmpN-lateralBreakPt+1))+ lateralBreakPt;
// create dynamic arrays
double *reMat1 = new double[width1*width1];
double *reMat2 = new double[width2*width2];
for (int n=0;n<width1; n++)
{
for (int j=0;j<width1; j++)
{
reMat1[n*width1+j]=(double)rand()/RAND_MAX + (double)rand()/(RAND_MAX*RAND_MAX);
//a[i*Width+j]=1.00;
}
}
for (int k=0;k<width2; k++)
{
for (int h=0;h<width2; h++)
{
reMat2[k*width2+h]=(double)rand()/RAND_MAX + (double)rand()/(RAND_MAX*RAND_MAX);
//a[i*Width+j]=1.00;
}
}
// send it back to master
MPI_Send(reMat1,width1*width1,MPI_DOUBLE,0,3,MPI_COMM_WORLD);
MPI_Send(reMat2,width2*width2,MPI_DOUBLE,0,4,MPI_COMM_WORLD);
}
MPI_Finalize();
std::cin.get();
return 0;
}
P.S. This code is the right answer.
Use collective MPI operations, as Zulan suggested. For example, first thing your code does is that the root sends to all the slaves the same value, which is broadcasting, i.e.,MPI_Bcast(). Then, the root sends to each slave a different value, which is scatter, i.e., MPI_Scatter().
The last operation is that the slave processes send to the root variably-sized data, for which exists the MPI_Gatherv() function. However, to use this function, you need to:
allocate the incoming buffer by the root (there is no malloc() for reMat1 and reMat2 in the first if-branch of your code), therefore, the root needs to know their count,
tell MPI_Gatherv() on the root how many elements will be received from each slave and where to put them.
This problem can be easily solved by so-called parallel prefix, look at MPI_Scan() or MPI_Exscan().
Here you create randomized width
int width1 = (rand() % (tmpN-firstBreakPt+1))+ firstBreakPt;
int width2 = (rand() % (tmpN-lateralBreakPt+1))+ lateralBreakPt;
which you later use to send data back to process 0
MPI_Send(reMat1,width1*width1,MPI_DOUBLE,0,3,MPI_COMM_WORLD);
But it expects different number of
MPI_Recv(reMat1,firstBreakPt*tmpN*firstBreakPt*tmpN,MPI_DOUBLE,m,3,MPI_COMM_WORLD,&status);
which causes problems. It does not know what sizes each slave process generated so you have to send them back the same way you did for sending sizes to them.

The same random numbers for each process in c++ code with MPI

I have C++ MPI code that works, in that it compiles and does indeed launch on the specified number of processors (n). The problem is that it simply does the same calculation n times, rather than doing one calculation n times faster.
I have hacked quite a few examples I have found on various sites, and it appears I am missing the proper use of MPI_Send and MPI_Receive, but I can't find an instance these commands that takes a function as input (and am confused as to why these MPI commands would be useful for anything other than functions).
My code is below. Essentially it calls a C++ function I wrote to get Fisher's Exact Test p-value. The random-number bit is just something I put in to test the speed.
What I want is for this program to do is farm out Fisher.TwoTailed with each set of random variables (i.e., A, B, C, and D) to a different processor, rather than doing the exact same calculation on multiple processors. Thanks in advance for any insight--cheers!
Here is the code:
int
main (int argc, char* argv[])
{
int id;
int p;
//
// Initialize MPI.
//
MPI::Init ( argc, argv );
//
// Get the number of processors.
//
p = MPI::COMM_WORLD.Get_size ( );
//
// Get the rank of this processor.
//
id = MPI::COMM_WORLD.Get_rank ( );
FishersExactTest Fishers;
int i = 0;
while (i < 10) {
int A = 0 + rand() % (100 - 0);
int B = 0 + rand() % (100 - 0);
int C = 0 + rand() % (100 - 0);
int D = 0 + rand() % (100 - 0);
cout << Fishers.TwoTailed(A, B, C, D) << endl;
i += 1;
}
MPI::Finalize ( );
return 0;
}
You should look into some basic training about parallel computing and MPI. One good resource that taught me the basics was a free set of online courses put up by the National Center for Supercomputing Applications (NCSA).
You have to tell MPI how to parallelize the code - it won't do it automatically.
In other words, you can't initialize MPI on all the systems and then pass them the same loop. You want to use the id of each processor to determine which part of the loop it will work on. Then you need them to all pass their results back to ID 0.
All of the above answers are perfectly correct. Let me just add a little bit:
Here, since it looks like you're just doing random sampling, all you have to do to get the different processors to generate different random numbers to give to Fishers.TwoTailed is to ensure they all have different seeds to the PRNG:
int
main (int argc, char* argv[])
{
int id;
int p;
//
// Initialize MPI.
//
MPI::Init ( argc, argv );
//
// Get the number of processors.
//
p = MPI::COMM_WORLD.Get_size ( );
//
// Get the rank of this processor.
//
id = MPI::COMM_WORLD.Get_rank ( );
FishersExactTest Fishers;
srand(id); // <--- each rank gets a different seed
int i = 0;
while (i < 10) {
int A = 0 + rand() % (100 - 0);
int B = 0 + rand() % (100 - 0);
int C = 0 + rand() % (100 - 0);
int D = 0 + rand() % (100 - 0);
cout << Fishers.TwoTailed(A, B, C, D) << endl;
i += 1;
}
MPI::Finalize ( );
return 0;
}
Because the loop is from 1..10, you'll still get each process doing 10 samples. If you want them to do a total of 10, you can divide 10 by p and do something to distribute the remainder: eg
int niters = (10+id)/p;
int i=0;
while (i < niters) {
...
}
Well, what messages do you get when you run your MPI job? Just to reiterate what the others have said, you will have to explicitly define what the job of each processor is...for example..if you are rank 0 (default), do this...if you are rank 1 so on..(or some other syntax) that defines the role for each rank. Then you could, based on how you structure your code, have nodes Send/Recv, Gather, Scatter etc.