Can someone explain and tell me more about MPI_Comm_split communicator?
MPI_Comm_split(MPI_COMM_WORLD, my_row, my_rank,&my_row_comm);
This is just example i met by reading some basic documentations. Maybe someone could tell me how this communicator is working?
Just to begin with, let's have a look at the man page:
MPI_Comm_split(3) MPI MPI_Comm_split(3)
NAME
MPI_Comm_split - Creates new communicators based on colors and keys
SYNOPSIS
int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
INPUT PARAMETERS
comm - communicator (handle)
color - control of subset assignment (nonnegative integer). Processes
with the same color are in the same new communicator
key - control of rank assignment (integer)
OUTPUT PARAMETERS
newcomm
- new communicator (handle)
So what does that do?
Well, as the name suggests, it will split the communicator comm into disjoint sub-communicators newcomm. Each process of comm will be into one unique of these sub-communicators, hence the fact that the output newcomm is only one single communicator (for the current process). However, globally speaking, you have to understand that the many versions of newcomm are different sub-communicators, partitioning the input comm.
So that is what the function does. But how does it do it?
Well, that's where the two parameters color and key come into play:
color is an integer value that permits to decide in which of the sub-communicators the current process will fall. More specifically, all processes of comm for which color will have the same numerical value will be part of the same sub-communicator newcomm. For example, if you were to define color = rank%2; (with rank the rank of the process in comm), then you would create (globally) two new communicators: one for the processes of odd ranks, and one for the processes of even ranks. However, keep in mind that each processes will only be seeing the one of these new communicators they are part of... So in summary, color permits to tell apart the various "teams" you will create, like the colour of the jersey football teams will wear to distinguish themselves during a match (hence the naming I presume).
key will just permit to optionally decide how the processes will be ranked into the new communicators they are part of. For example, if you set key = rank;, then the order of ranking (not the ranking itself) in each new communicators newcomm will follow the order of ranking in the original communicator comm. But if you don't care about the ordering, you can as well set key=0; and the ranking in each of the new communicators will be whatever the library decides...
Finally, two trivial examples:
MPI_Comm_split(comm, 0, rank, &newcomm) will just duplicate comm into newcomm (just as MPI_Comm_dup())
MPI_Comm_split(comm, rank, rank, &newcomm) will just return an equivalent of MPI_COMM_SELF for each of the processes
Related
In a distributed Erlang system pids can have two different representations: i) internal; ii) external.
The internal representation has the following shape: < A.B.C >
The external representation, used for instance when a message has to travel across different nodes, is instead composed of the following elements: < node_id, ID, serial, creation > according to the official documentation.
Where node_id is the name of the node, ID and serial identify the process on node_id and creation is an integer used to distinguish the node from past (crashed) version of itself.
What I could not find is how the creation integer is created by the VM.
By setting a small experiment on my PC, I have seen that if I create and kill the same node several times the counter is always increased by 1, and by creating the same node on different machines, the creation integers are different, but have some similarities in their structure, for instance:
machine 1 -> creation integer = 1647595383
machine 2 -> creation integer = 1647596018
Do any of you have any knowledge about how this integer is created? If so could you please explain it to me and possibly reference some (more or less) official documentation?
The creation is sent as a part of the response to node registration in epmd, see details on that protocol.
If you have a custom erl_epmd module, you can also provide your own way of creating the creation-value.
The original creation is the local time of when the node with that name is first registered, and then it is bumped once for each time the name is re-registered.
Suppose that I create custom MPI_Datatypes for subarrays of different sizes on each of the MPI processes allocated to a program. Now I wish to send these subarrays to the master process and assemble them into a bigger array block by block. The master process is unaware of the individual datatypes (defined by the local sizes) on the other processes. Naively, therefore, I might attempt to send over these custom datatypes to the master process in the following manner.
MPI_Datatype localarr_type;
MPI_Type_create_subarray( NDIMS, array_size, local_size, box_corner, MPI_ORDER_C, MPI_FLOAT, &localarr_type );
MPI_Type_Commit(&localarr_type);
if (rank == master)
{
for (int id = 1; id < nprocs; ++id)
{
MPI_Recv( &localarr_type, 1, MPI_Datatype, id, tag1[id], comm_cart, MPI_STATUS_IGNORE );
MPI_Recv( big_array, 1, localarray_type, id, tag2[id], comm_cart, MPI_STATUS_IGNORE );
}
}
else
{
MPI_Send( &localarr_type, 1, MPI_Datatype, master, tag1[rank], comm_cart );
MPI_Send( local_box, 1, localarr_type, master, tag2[rank], comm_cart );
}
However, this results in a compilation error with the following error message from the GNU and CLANG compilers, and the latter error message from the Intel compiler.
/* GNU OR CLANG COMPILER */
error: unexpected type name 'MPI_Datatype': expected expression
/* INTEL COMPILER */
error: type name is not allowed
This means that either (1) I am attempting to send a custom MPI_Datatype over to a different process in the wrong way or that (2) this is not possible at all. I would like to know which it is, and if it is (1), I would like to know what the correct way of communicating a custom MPI_Datatype is. Thank you.
Note.
I am aware of other ways of solving the above problem without needing to communicate MPI_Datatypes. For example, one could communicate the local array sizes and manually reconstruct the MPI_Datatype from other processes inside the master process before using it in the subsequent communication of subarrays. This is not what I am looking for.
I wish to communicate the custom MPI_Datatype itself (as shown in the example above), not something that is an instance of the datatype (which is doable, as also shown in the example code above).
First of all: You can not send a datatype like that. The value MPI_Datatype is not a value of type MPI_Datatype. (It's a cute idea though.) You could send the parameters with which it is constructed, and the reconstruct it on the sending type.
However, you are probably misunderstanding the nature of MPI. In your code, with the same datatype on workers and manager, you are sort of assuming that everyone has data of the same size/shape. That is not compatible with the manager gathering everything together.
If you're gathering data on a manager process (usually not a good idea: are you really sure you need that?) then the contributing processes have the data in a small array, say at index 0..99. So you can send them as an ordinary contiguous buffer. The "manager" has a much larger array, and places all the contributions in disjoint locations. So at most the manager needs to create subarray types to indicate where the received data goes in the big array.
I have a list of indices for which I do not know their corresponding entries in a vector, because the vector is distributed among the ranks. I have to send these indices to the ranks in charge to get the data.
On the other hand "my" rank also get lists of indices from an unknown number of ranks. After receiving the list, "my" rank has to send the corresponding data to this requesting ranks.
I think I have to work with a mixture of MPI_Probe and MPI_Gather. But at the moment I cannot see how to receive lists from an unknown number of ranks.
I think it has to look like this, but how can I receive the data from a bigger unknown number of rank? Or do I have to loop over all possible ranks, that could send me something?
MPI_Status status;
int nbytes;
std::vector<Size> indices;
MPI_Probe(MPI_ANY_SOURCE,MPI_ANY_TAG, comm, &status);
MPI_Get_count(&status,MPI_UINT64_T, &nbytes);
if(nbytes!=MPI_UNDEFINED){
indices.reserve(nbytes);
MPI_Recv(&indices[0],nbytes,MPI_UINT64_T,status.SOURCE,status.TAG,comm,&status);
}
This resembles a lot what I did a few years ago for parallel I/O.
One option:
From all senders, get the size that you need to send to each other rank
Send the sizes (Allgather if all ranks can be senders, otherwise sends/receives)
Does a (all)gatherv that will retrieve the size on each receiver
You can use non blocking send/receives as well as gatherv (MPI3) and this scales well (depending ont he hardware) to 500 cores for 8 senders.
The way we did it was to go through the vector by chunk of several MB and send the data in chunks. Of course, the bigger the chunks the better, but also the more memory you need on each sender ranks to hold the data.
I`m pretty new in MPI programming and got stuck in the middle of my project.
I want to write an MPI code for the following problem. I am not sure which functions from MPI is appropriate.
Here is the problem:
Processor 0 has a 2D vector or array of Edges={(0,4),(1,5)}. It needs to get some information from the other processors, which is not always a fixed processor, it depends on set Edges. Therefore, I need a for loop as follows:
if(my_rank==0)
{
for(all pairs (i,j) in Edges)
{
send i (or j) to Processor r (r depends on the index i)
recieve L_r from Processor r
create (L_i, L_j, min(L_i,L_j)) // want to broadcast to all later.
}
}
Now, I am not sure how to do it for processor r, should I do in a for loop?
Note that I can not do it in an if statement since I dont know which processor would be and so based on the number of processors I need an if statement which I don`t think is a right way. I might have so many processors which each holds some part of a matrix.
Need to point that I cannot communicate with a subgroup of communicators, since it all depends on the indices, basically, I want the labels for example indices (0,4) which need to communicate with P4 that holds it.
Any ideas are appreciated.
I would do it as follow:
1) Proc 0 construct a list of every processes it has to comunicate with.
2) Proc 0 broadcast this list to all processes (or only to the one he have to communicate with, but that will be more complicated, can be done once you got a version which works)
3) You perform your comm:
If(rank==0){...}
else if (rank in the list){...}
I want so send a value to a position in array of another process.
so
1st process: MPI_ISend (&val..., process, ..)
2nd process: MPI_Recv (&array[i], ..., process, ...)
So I know the i number on the first process, I also know, that I can't use a variable - first send i and then val, as other processes can change i ( 2nd process is accepting messages from many others).
First of all other send/receives should not/cannot overwrite i. You should keep your messages clear and separated. That's what the tag is for! Also rank_2 can separate which rank did send the data. So you can have one i for every rank you await a message from.
Finally you might want to check out one-sided MPI communication (MPI_Win). With that technique rank_1 can 'drop' the message directly into rank_2's array at the position only known to rank_1.