Simple MPI_Gather test with memcpy error - c++

I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.

The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.

Related

MPI RMA from multiple threads

In my application I am implementing what I would call a "reduce on sparse vectors" via a TBB flow graph compared with MPI one-sided communication (RMA). The central piece of the algorithm looks as follows:
auto &reduce = m_g_R.add<function_node<ReductionJob, ReductionJob>>(
serial,
[=, &reduced_bi](ReductionJob rj) noexcept
{
const auto r = std::get<0>(rj);
auto *buffer = std::get<1>(rj)->data.data();
auto &mask = std::get<1>(rj)->mask;
if (m_R_comms[r] != MPI_COMM_NULL)
{
const size_t n = reduced_bi.dim(r);
MPI_Win win;
MPI_Win_create(
buffer,
r == mr ? n * sizeof(T) : 0,
sizeof(T),
MPI_INFO_NULL,
m_R_comms[r],
&win
);
if (n > 0 && r != mr)
{
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
size_t i = 0;
do
{
while (i < n && !mask[i]) ++i;
size_t base = i;
while (i < n && mask[i]) ++i;
if (i > base) MPI_Accumulate(
buffer + base, i - base, MpiType<T>::type,
0,
base, i - base, MpiType<T>::type,
MPI_SUM,
win
);
}
while (i < n);
MPI_Win_unlock(0, win);
}
MPI_Win_free(&win);
}
return rj;
}
);
This is executed for each rank r participating in the calculation, with reduced_bi.dim(r) specifying how many elements each rank owns. mr is the current rank, and the communicators are created in such a way that the target process is root for each of them. buffer is an array of T = double (typically), and mask is an std::vector<bool> identifying which elements are non-zero. The combination of loops splits the communication into chunks of non-zero elements.
This generally works fine and results are correct, same as my previous implementation using MPI_Reduce. However, is seems to be crucial that the concurrency level for this node is set to serial, indicating that there is at most one parallel TBB task (and thus at most one thread) executing this code.
I would like to set it to unlimited to improve performance, and indeed that works fine that way on my laptop with small jobs, running with MPICH 3.4.1. On the cluster where I really want to run the computation, however, using OpenMPI 4.1.1, it runs for a while before crashing with a segfault and a backtrace involving a bunch of UCX functions.
I wonder now, is it not allowed to have multiple threads in parallel call RMA operations like this (and on my laptop it only works accidentally), or am I hitting a bug/limitation on the cluster? From the documentation I do not see directly that what I would like to do is not supported.
Of course, MPI is initialized with MPI_THREAD_MULTIPLE and I repeat again that the snippet as posted above works fine, only when I change serial --> unlimited to enable concurrent execution do I hit the problem on the cluster.
In reply to Victor Eijkhout comment(s) below, here is a complete sample program that reproduces the issue. This runs fine on my laptop (tested specifically with mpirun -n 16), but it crashes on the cluster when I run it with 16 ranks (spread across 4 cluster nodes).
#include <iostream>
#include <vector>
#include <thread>
#include <mpi.h>
int main(void)
{
int requested = MPI_THREAD_MULTIPLE, provided;
MPI_Init_thread(nullptr, nullptr, requested, &provided);
if (provided != requested)
{
std::cerr << "Failed to initialize MPI with full thread support!"
<< std::endl;
exit(1);
}
int mr, nr;
MPI_Comm_rank(MPI_COMM_WORLD, &mr);
MPI_Comm_size(MPI_COMM_WORLD, &nr);
const size_t dim = 1024;
const size_t repeat = 100;
std::vector<double> send(dim, static_cast<double>(mr) + 1.0);
std::vector<double> recv(dim, 0.0);
MPI_Win win;
MPI_Win_create(
recv.data(),
recv.size() * sizeof(double),
sizeof(double),
MPI_INFO_NULL,
MPI_COMM_WORLD,
&win
);
std::vector<std::thread> threads;
for (size_t i = 0; i < repeat; ++i)
{
threads.clear();
threads.reserve(nr);
for (int r = 0; r < nr; ++r) if (r != mr)
{
threads.emplace_back([r, &send, &win]
{
MPI_Win_lock(MPI_LOCK_SHARED, r, 0, win);
for (size_t i = 0; i < dim; ++i) MPI_Accumulate(
send.data() + i, 1, MPI_DOUBLE,
r,
i, 1, MPI_DOUBLE,
MPI_SUM,
win
);
MPI_Win_unlock(r, win);
});
}
for (auto &t : threads) t.join();
MPI_Barrier(MPI_COMM_WORLD);
if (mr == 0) std::cout << recv.front() << std::endl;
}
MPI_Win_free(&win);
MPI_Finalize();
}
Note: I am intentionally using plain threads here to avoid unnecessary dependencies. It should be linked with -lpthread.
The specific error I get on the cluster is this, using OpenMPI 4.1.1:
*** An error occurred in MPI_Accumulate
*** reported by process [1829189442,11]
*** on win ucx window 3
*** MPI_ERR_RMA_SYNC: error executing rma sync
*** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
*** and potentially your MPI job)
Possible relevant parts from ompi_info:
Open MPI: 4.1.1
Open MPI repo revision: v4.1.1
Open MPI release date: Apr 24, 2021
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes)
It has been compiled with UCX/1.10.1.
The style in C++ is to put the * or & with the type, not the identifier. This is called out specifically near the beginning of Stroustrup’s first book, and is an intentional difference from C style.
Create -- Lock -- Unlock -- Free
⧺R.1 Manage resources automatically using resource handles and RAII (Resource Acquisition Is Initialization)
Use a wrapper class, either written for the purpose, designed for this C API, a general purpose resource manager template, or unique_ptr with a custom deleter, rather than explicit calls that must be matched up for correct behavior.
RAII/RFID is one of C++'s fundamental strengths, and using it will go a long way to making your code less buggy and more maintainable in time.
Use destructuring syntax.
const auto r = std::get<0>(rj);
auto *buffer = std::get<1>(rj)->data.data();
auto &mask = std::get<1>(rj)->mask;
rather than referring to get<0> and get<1> you can immediately name the components.
const auto& [r, fred] = rj;
auto* buffer = fred->data.data();
auto& mask = fred->mask;

Time efficient design model for sending to and receiving from all mpi processes: MPI all 2 all communication

I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.

Why does MPI_Recv fail when there is an accumulation of MPI_Send calls

I have an MPI program in which worker ranks (rank != 0) make a bunch of MPI_Send calls, and the master rank (rank == 0) receives all these messages. However, I run into a Fatal error in MPI_Recv - MPI_Recv(...) failed, Out of memory.
Here is the code that I am compiling in Visual Studio 2010.
I run the executable like so:
mpiexec -n 3 MPIHelloWorld.exe
int main(int argc, char* argv[]){
int numprocs, rank, namelen, num_threads, thread_id;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
if(rank == 0){
for(int k=1; k<numprocs; k++){
for(int i=0; i<1000000; i++){
double x;
MPI_Recv(&x, 1, MPI_DOUBLE, k, i, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
}
}
else{
for(int i=0; i<1000000; i++){
double x = 5;
MPI_Send(&x, 1, MPI_DOUBLE, 0, i, MPI_COMM_WORLD);
}
}
}
If I run with only 2 processes, the program does not crash. So it seems like the problem is when there is an accumulation of the MPI_Send calls from a third rank (aka a second worker node).
If I decrease the number of iterations to 100,000 then I can run with 3 processes without crashing. However, the amount of data being sent with one million iterations is ~ 8 MB (8 bytes for double * 1000000 iterations), so I don't think the "Out of Memory" is referring to any physical memory like RAM.
Any insight is appreciated, thanks!
The MPI_send operation stores the data on the system buffer ready to send. The size of this buffer and where it is stored is implementation specific (I remember hearing that this can even be in the interconnects). In my case (linux with mpich) I don't get a memory error. One way to explicitly change this buffer is to use MPI_buffer_attach with MPI_Bsend. There may also be a way to change the system buffer size (e.g. MP_BUFFER_MEM system variable on IBM systems).
However that this situation of unrequited messages should probably not occur in practice. In your example above, the order of the k and i loops could be swapped to prevent this build up of messages.

Share memory across MPI nodes to prevent unecessary copying

I have an algorithm where in each iteration each node has to calculate a segment of an array, where each element of x_ depends on all the elements of x.
x_[i] = some_func(x) // each x_[i] depends on the entire x
That is, each iteration takes x and calculates x_, which will be the new x for the next iteration.
A way of paralelizing this is MPI would be to split x_ between the nodes and have an Allgather call after the calculation of x_ so that each processor would send its x_ to the appropriate location in x in all the other processors, then repeat. This is very inefficient since it requires an expensive Allgather call every iteration, not to mention it requires as many copies of x as there are nodes.
I've thought of an alternative way that doesn't require copying. If the program is running on a single machine, with shared RAM, would it be possible to just share the x_ between the nodes (without copying)? That is, after calculating x_ each processor would make it visible to the other nodes, which could then use it as their x for the next iteration without needing to make several copies. I can design the algorithm so that no processor accesses the same x_ at the same time, which is why making a private copy for each node is overkill.
I guess what I'm asking is: can I share memory in MPI simply by tagging an array as shared-between-nodes, as opposed to manually making a copy for each node? (for simplicity assume I'm running on one CPU)
You can share memory within a node using MPI_Win_allocate_shared from MPI-3. It provides a portable way to use Sys5 and POSIX shared memory (and anything similar).
MPI functions
The following are taken from the MPI 3.1 standard.
Allocating shared memory
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
IN disp_unit local unit size for displacements, in bytes (positive integer)
IN info info argument (handle) IN comm intra-communicator (handle)
OUT baseptr address of local allocated window segment (choice)
OUT win window object returned by the call (handle)
int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win)
(if you want the Fortran declaration, click the link)
You deallocate memory using MPI_Win_free. Both allocation and deallocation are collective. This is unlike Sys5 or POSIX, but makes the interface much simpler on the user.
Querying the node allocations
In order to know how to perform load-store against another process' memory, you need to query the address of that memory in the local address space. Sharing the address in the other process' address space is incorrect (it might work in some cases, but one cannot assume it will work).
MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr)
IN win shared memory window object (handle)
IN rank rank in the group of window win (non-negative integer) or MPI_PROC_NULL
OUT size size of the window segment (non-negative integer)
OUT disp_unit local unit size for displacements, in bytes (positive integer)
OUT baseptr address for load/store access to window segment (choice)
int MPI_Win_shared_query(MPI_Win win, int rank, MPI_Aint *size, int *disp_unit, void *baseptr)
(if you want the Fortran declaration, click the link above)
Synchronizing shared memory
MPI_WIN_SYNC(win)
IN win window object (handle)
int MPI_Win_sync(MPI_Win win)
This function serves as a memory barrier for load-store accesses to the data associated with the shared memory window.
You can also use ISO language features (i.e. those provided by C11 and C++11 atomics) or compiler extensions (e.g. GCC intrinsics such as __sync_synchronize) to attain a consistent view of data.
Synchronization
If you understand interprocess shares memory semantics already, the MPI-3 implementation will be easy to understand. If not, just remember that you need to synchronize memory and control flow correctly. There is MPI_Win_sync for the former, while existing MPI sync functions like MPI_Barrier and MPI_Send+MPI_Recv will work for the latter. Or you can use MPI-3 atomics to build counters and locks.
Example program
The following code is from https://github.com/jeffhammond/HPCInfo/tree/master/mpi/rma/shared-memory-windows, which contains example programs of shared-memory usage that have been used by the MPI Forum to debate the semantics of these features.
This program demonstrates unidirectional pair-wise synchronization through shared-memory. If you merely want to create a WORM (write-once, read-many) slab, that should be much simpler.
#include <stdio.h>
#include <mpi.h>
/* This function synchronizes process rank i with process rank j
* in such a way that this function returns on process rank j
* only after it has been called on process rank i.
*
* No additional semantic guarantees are provided.
*
* The process ranks are with respect to the input communicator (comm). */
int p2p_xsync(int i, int j, MPI_Comm comm)
{
/* Avoid deadlock. */
if (i==j) {
return MPI_SUCCESS;
}
int rank;
MPI_Comm_rank(comm, &rank);
int tag = 666; /* The number of the beast. */
if (rank==i) {
MPI_Send(NULL, 0, MPI_INT, j, tag, comm);
} else if (rank==j) {
MPI_Recv(NULL, 0, MPI_INT, i, tag, comm, MPI_STATUS_IGNORE);
}
return MPI_SUCCESS;
}
/* If val is the same at all MPI processes in comm,
* this function returns 1, else 0. */
int coll_check_equal(int val, MPI_Comm comm)
{
int minmax[2] = {-val,val};
MPI_Allreduce(MPI_IN_PLACE, minmax, 2, MPI_INT, MPI_MAX, comm);
return ((-minmax[0])==minmax[1] ? 1 : 0);
}
int main(int argc, char * argv[])
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int * shptr = NULL;
MPI_Win shwin;
MPI_Win_allocate_shared(rank==0 ? sizeof(int) : 0,sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD,
&shptr, &shwin);
/* l=local r=remote */
MPI_Aint rsize = 0;
int rdisp;
int * rptr = NULL;
int lint = -999;
MPI_Win_shared_query(shwin, 0, &rsize, &rdisp, &rptr);
if (rptr==NULL || rsize!=sizeof(int)) {
printf("rptr=%p rsize=%zu \n", rptr, (size_t)rsize);
MPI_Abort(MPI_COMM_WORLD, 1);
}
/*******************************************************/
MPI_Win_lock_all(0 /* assertion */, shwin);
if (rank==0) {
*shptr = 42; /* Answer to the Ultimate Question of Life, The Universe, and Everything. */
MPI_Win_sync(shwin);
}
for (int j=1; j<size; j++) {
p2p_xsync(0, j, MPI_COMM_WORLD);
}
if (rank!=0) {
MPI_Win_sync(shwin);
}
lint = *rptr;
MPI_Win_unlock_all(shwin);
/*******************************************************/
if (1==coll_check_equal(lint,MPI_COMM_WORLD)) {
if (rank==0) {
printf("SUCCESS!\n");
}
} else {
printf("rank %d: lint = %d \n", rank, lint);
}
MPI_Win_free(&shwin);
MPI_Finalize();
return 0;
}

MPI virtual topology design

I have been trying to create a star topology using MPI_Comm_split but I seem to have and issue when I try to establish the links withing all processes. The processes are allo expected to link to p0 of MPI_COMM_WORLD . Problem is I get a crash in the line
error=MPI_Intercomm_create( MPI_COMM_WORLD, 0, NEW_COMM, 0 ,create_tag, &INTERCOMM );
The error is : MPI_ERR_COMM: invalid communicator .
I have and idea of the cause though I don't know how to fix it . It seems this is due to a call by process zero which doesn't belong to the new communicator(NEW_COMM) . I have tried to put an if statement to stop execution of this line if process = 0, but this again fails since its a collective call.
Any suggestions would be appreciated .
#include <iostream>
#include "mpi.h"
using namespace std;
int main(){
MPI_Comm NEW_COMM , INTERCOMM;
MPI_Init(NULL,NULL);
int world_rank , world_size,new_size, error;
error = MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
error = MPI_Comm_size(MPI_COMM_WORLD,&world_size);
int color = MPI_UNDEFINED;
if ( world_rank > 0 )
color = world_rank ;
error = MPI_Comm_split(MPI_COMM_WORLD, color , world_rank, &NEW_COMM);
int new_rank;
if ( world_rank > 0 ) {
error = MPI_Comm_rank( NEW_COMM , &new_rank);
error = MPI_Comm_size(NEW_COMM, &new_size);
}
int create_tag = 99;
error=MPI_Intercomm_create( MPI_COMM_WORLD, 0, NEW_COMM, 0 ,create_tag, &INTERCOMM );
if ( world_rank > 0 )
cout<<" My Rank in WORLD = "<< world_rank <<" New rank = "<<new_rank << " size of NEWCOMM = "<<new_size <<endl;
else
cout<<" Am centre "<<endl;
MPI_Finalize();
return 0;
}
What about using a MPI topology rather? Something like this:
#include <mpi.h>
#include <iostream>
int main( int argc, char *argv[] ) {
MPI_Init( &argc, &argv );
int rank, size;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
int indegree, outdegree, *sources, *sourceweights, *destinations, *destweights;
if ( rank == 0 ) { //centre of the star
indegree = outdegree = size - 1;
sources = new int[size - 1];
sourceweights = new int[size - 1];
destinations = new int[size - 1];
destweights = new int[size - 1];
for ( int i = 0; i < size - 1; i++ ) {
sources[i] = destinations[i] = i + 1;
sourceweights[i] = destweights[i] = 1;
}
}
else { // tips of the star
indegree = outdegree = 1;
sources = new int[1];
sourceweights = new int[1];
destinations = new int[1];
destweights = new int[1];
sources[0] = destinations[0] = 0;
sourceweights[0] = destweights[0] = 1;
}
MPI_Comm star;
MPI_Dist_graph_create_adjacent( MPI_COMM_WORLD, indegree, sources, sourceweights,
outdegree, destinations, destweights, MPI_INFO_NULL,
true, &star );
delete[] sources;
delete[] sourceweights;
delete[] destinations;
delete[] destweights;
int starrank;
MPI_Comm_rank( star, &starrank );
std::cout << "Process #" << rank << " of MPI_COMM_WORLD is process #" << starrank << " of the star\n";
MPI_Comm_free( &star);
MPI_Finalize();
return 0;
}
Is that the sort of thing you were after? If not, what is your communicator for?
EDIT: Explanation about MPI topologies
I wanted to clarify that, even if this graph communicator is presented as such, it is no different to MPI_COMM_WORLD in most aspects. Notably, it comprises the whole set of MPI processes initially present in MPI_COMM_WORLD. Indeed, although its star shape has been defined and we didn't represent any link between process #1 and process #2 for example, nothing prevents you from making a point to point communication between these two processes. Simply, by defining this graph topology, you give an indication of the sort of communication pattern your code will expose. Then you ask the library to try to reorder the ranks on the physical nodes, for coming up with a possibly better match between the physical layout of your machine / network, and the needs you express. This can be done internally by an algorithm minimising a cost function using a simulated annealing method for example, but this is costly. Moreover, this supposes that the actual layout of the network is available somewhere for the library to use (which isn't the case most of the time). So at the end of the day, most of the time, this placement optimisation phase is just ignored and you end-up with the same indexes as the ones you entered... Only do I know about some meshed / torus shaped network-based machines to actually perform the placement phase for MPI_Car_create(), but maybe I'm outdated on that.
Anyway, the bottom line is that I understand you want to play with communicators for learning, but don't expect too much out of them. The best thing to learn here is how to get the ones you want in the least and simplest possible calls, which is I hope what I proposed.