MPI receiving data from an unknown number of ranks - c++

I have a list of indices for which I do not know their corresponding entries in a vector, because the vector is distributed among the ranks. I have to send these indices to the ranks in charge to get the data.
On the other hand "my" rank also get lists of indices from an unknown number of ranks. After receiving the list, "my" rank has to send the corresponding data to this requesting ranks.
I think I have to work with a mixture of MPI_Probe and MPI_Gather. But at the moment I cannot see how to receive lists from an unknown number of ranks.
I think it has to look like this, but how can I receive the data from a bigger unknown number of rank? Or do I have to loop over all possible ranks, that could send me something?
MPI_Status status;
int nbytes;
std::vector<Size> indices;
MPI_Probe(MPI_ANY_SOURCE,MPI_ANY_TAG, comm, &status);
MPI_Get_count(&status,MPI_UINT64_T, &nbytes);
if(nbytes!=MPI_UNDEFINED){
indices.reserve(nbytes);
MPI_Recv(&indices[0],nbytes,MPI_UINT64_T,status.SOURCE,status.TAG,comm,&status);
}

This resembles a lot what I did a few years ago for parallel I/O.
One option:
From all senders, get the size that you need to send to each other rank
Send the sizes (Allgather if all ranks can be senders, otherwise sends/receives)
Does a (all)gatherv that will retrieve the size on each receiver
You can use non blocking send/receives as well as gatherv (MPI3) and this scales well (depending ont he hardware) to 500 cores for 8 senders.
The way we did it was to go through the vector by chunk of several MB and send the data in chunks. Of course, the bigger the chunks the better, but also the more memory you need on each sender ranks to hold the data.

Related

How does one send custom MPI_Datatype over to a different process?

Suppose that I create custom MPI_Datatypes for subarrays of different sizes on each of the MPI processes allocated to a program. Now I wish to send these subarrays to the master process and assemble them into a bigger array block by block. The master process is unaware of the individual datatypes (defined by the local sizes) on the other processes. Naively, therefore, I might attempt to send over these custom datatypes to the master process in the following manner.
MPI_Datatype localarr_type;
MPI_Type_create_subarray( NDIMS, array_size, local_size, box_corner, MPI_ORDER_C, MPI_FLOAT, &localarr_type );
MPI_Type_Commit(&localarr_type);
if (rank == master)
{
for (int id = 1; id < nprocs; ++id)
{
MPI_Recv( &localarr_type, 1, MPI_Datatype, id, tag1[id], comm_cart, MPI_STATUS_IGNORE );
MPI_Recv( big_array, 1, localarray_type, id, tag2[id], comm_cart, MPI_STATUS_IGNORE );
}
}
else
{
MPI_Send( &localarr_type, 1, MPI_Datatype, master, tag1[rank], comm_cart );
MPI_Send( local_box, 1, localarr_type, master, tag2[rank], comm_cart );
}
However, this results in a compilation error with the following error message from the GNU and CLANG compilers, and the latter error message from the Intel compiler.
/* GNU OR CLANG COMPILER */
error: unexpected type name 'MPI_Datatype': expected expression
/* INTEL COMPILER */
error: type name is not allowed
This means that either (1) I am attempting to send a custom MPI_Datatype over to a different process in the wrong way or that (2) this is not possible at all. I would like to know which it is, and if it is (1), I would like to know what the correct way of communicating a custom MPI_Datatype is. Thank you.
Note.
I am aware of other ways of solving the above problem without needing to communicate MPI_Datatypes. For example, one could communicate the local array sizes and manually reconstruct the MPI_Datatype from other processes inside the master process before using it in the subsequent communication of subarrays. This is not what I am looking for.
I wish to communicate the custom MPI_Datatype itself (as shown in the example above), not something that is an instance of the datatype (which is doable, as also shown in the example code above).
First of all: You can not send a datatype like that. The value MPI_Datatype is not a value of type MPI_Datatype. (It's a cute idea though.) You could send the parameters with which it is constructed, and the reconstruct it on the sending type.
However, you are probably misunderstanding the nature of MPI. In your code, with the same datatype on workers and manager, you are sort of assuming that everyone has data of the same size/shape. That is not compatible with the manager gathering everything together.
If you're gathering data on a manager process (usually not a good idea: are you really sure you need that?) then the contributing processes have the data in a small array, say at index 0..99. So you can send them as an ordinary contiguous buffer. The "manager" has a much larger array, and places all the contributions in disjoint locations. So at most the manager needs to create subarray types to indicate where the received data goes in the big array.

MPI - how to send avalue to a specific position in array

I want so send a value to a position in array of another process.
so
1st process: MPI_ISend (&val..., process, ..)
2nd process: MPI_Recv (&array[i], ..., process, ...)
So I know the i number on the first process, I also know, that I can't use a variable - first send i and then val, as other processes can change i ( 2nd process is accepting messages from many others).
First of all other send/receives should not/cannot overwrite i. You should keep your messages clear and separated. That's what the tag is for! Also rank_2 can separate which rank did send the data. So you can have one i for every rank you await a message from.
Finally you might want to check out one-sided MPI communication (MPI_Win). With that technique rank_1 can 'drop' the message directly into rank_2's array at the position only known to rank_1.

determining the size of the buffer at run time? (socket programming)

In order to determine the type of message received in a UDP packet, there is a need to look at specific buffer element [i] received from "recvfrom" in order to discern the type of message intended. first, i use a buffer in the stack to populate the buffer (of recvfrom), i know the maximum size of the message i should receive.
So say my array buffer is of 300 bytes, and i receive a packets of different sizes (e.g. 30, 80, 210 byes etc)....how can i know the size received (this is because there are few other criteria i test for to determine the nature of the message )
Knowing the size will enable me to use memcpy to an object.
i'm thinking of strlen(udp packet) because it is determined at runtime as opposed to compile time.
the problem is what if the rest of packet was filled with junk....
I appreciate it
recv(2), which is used to receive a UDP packet, returns the number of bytes received.

Socket Commuication with High frequency

I need to send data to another process every 0.02s.
The Server code:
//set socket, bind, listen
while(1){
sleep(0.02);
echo(newsockfd);
}
void echo (int sock)
{
int n;
char buffer[256]="abc";
n=send(sock,buffer,strlen(buffer),0);
if (n < 0) error("ERROR Sending");
}
The Client code:
//connect
while(1)
{
bzero(buffer,256);
n = read(sock,buffer,255);
printf("Recieved data:%s\n",buffer);
if (n < 0)
error("ERROR reading from socket");
}
The problem is that:
The client shows something like this:
Recieved data:abc
Recieved data:abcabcabc
Recieved data:abcabc
....
How does it happen? When I set sleep time:
...
sleep(2)
...
It would be ok:
Recieved data:abc
Recieved data:abc
Recieved data:abc
...
TCP sockets do not guarantee framing. When you send bytes over a TCP socket, those bytes will be received on the other end in the same order, but they will not necessarily be grouped the same way — they may be split up, or grouped together, or regrouped, in any way the operating system sees fit.
If you need framing, you will need to send some sort of packet header to indicate where each chunk of data starts and ends. This may take the form of either a delimiter (e.g, a \n or \0 to indicate where each chunk ends), or a length value (e.g, a number at the head of each chunk to denote how long it is).
Also, as other respondents have noted, sleep() takes an integer, so you're effectively not sleeping at all here.
sleep takes unsigned int as argument, so sleep(0.02) is actually sleep(0).
unsigned int sleep(unsigned int seconds);
Use usleep(20) instead. It will sleep in microseconds:
int usleep(useconds_t usec);
The OS is at liberty to buffer data (i.e. why not just send a full packet instead of multiple packets)
Besides sleep takes a unsigned integer.
The reason is that the OS is buffering data to be sent. It will buffer based on either size or time. In this case, you're not sending enough data, but you're sending it fast enough the OS is choosing to bulk it up before putting it on the wire.
When you add the sleep(2), that is long enough that the OS chooses to send a single "abc" before the next one comes in.
You need to understand that TCP is simply a byte stream. It has no concept of messages or sizes. You simply put bytes on the wire on one end and take them off on the other. If you want to do specific things, then you need to interpret the data special ways when you read it. Because of this, the correct solution is to create an actual protocol for this. That protocol could be as simple as "each 3 bytes is one message", or more complicated where you send a size prefix.
UDP may also be a good solution for you, depending on your other requirements.
sleep(0.02)
is effectively
sleep(0)
because argument is unsigned int, so implicit conversion does it for you. So you have no sleep at all here. You can use sleep(2) to sleep for 2 microseconds.Next, even if you had, there is no guarantee that your messages will be sent in a different frames. If you need this, you should apply some sort of delimiter, I have seen
'\0'
character in some implementation.
TCPIP stacks buffer up data until there's a decent amount of data, or until they decide that there's no more coming from the application and send what they've got anyway.
There are two things you will need to do. First, turn off Nagle's algorithm. Second, sort out some sort of framing mechanism.
Turning off Nagle's algorithm will cause the stack to "send data immediately", rather than waiting on the off chance that you'll be wanting to send more. It actually leads to less network efficiency because you're not filling up Ethernet frames, something to bare in mind on Gigabit where jumbo frames are required to get best throughput. But in your case timeliness is more important than throughput.
You can do your own framing by very simple means, eg by send an integer first that says how long the rest if the message will be. At the reader end you would read the integer, and then read that number of bytes. For the next message you'd send another integer saying how long that message is, etc.
That sort of thing is ok but not hugely robust. You could look at something like ASN.1 or Google Protocol buffers.
I've used Objective System's ASN.1 libraries and tools (they're not free) and they do a good job of looking after message integrity, framing, etc. They're good because they don't read data from a network connection one byte at a time so the efficiency and speed isn't too bad. Any extra data read is retained and included in the next message decode.
I've not used Google Protocol Buffers myself but it's possible that they have similar characteristics, and there maybe other similar serialisation mechanisms out there. I'd recommend avoiding XML serialisation for speed/efficiency reasons.

Handling TCP Streams

Our server is seemingly packet based. It is an adaptation from an old serial based system. It has been added, modified, re-built, etc over the years. Since TCP is a stream protocol and not a packet protocol, sometimes the packets get broken up. The ServerSocket is designed in such a way that when the Client sends data, part of the data contains the size of our message such as 55. Sometimes these packets are split into multiple pieces. They arrive in order but since we do not know how the messages will be split, our server sometimes does not know how to identify the split message.
So, having given you the background information. What is the best method to rebuild the packets as they come in if they are split? We are using C++ Builder 5 (yes I know, old IDE but this is all we can work with at the moment. ALOT of work to re-design in .NET or newer technology).
TCP guarantees that the data will arrive in the same order it was sent.
That beeing said, you can just append all the incoming data to a buffer. Then check if your buffer contains one or more packets, and remove them from the buffer, keeping all the remaining data into the buffer for future check.
This, of course, suppose that your packets have some header that indicates the size of the following data.
Lets consider packets have the following structure:
[LEN] X X X...
Where LEN is the size of the data and each X is an byte.
If you receive:
4 X X X
[--1--]
The packet is not complete, you can leave it in the buffer. Then, other data arrives, you just append it to the buffer:
4 X X X X 3 X X X
[---2---]
You then have 2 complete messages that you can easily parse.
If you do it, don't forget to send any length in a host-independant form (ntohs and ntohl can help).
This is often accomplished by prefixing messages with a one or two-byte length value which, like you said, gives the length of the remaining data. If I've understood you correctly, you're sending this as plain text (i.e., '5', '5') and this might get split up. Since you don't know the length of a decimal number, it's somewhat ambiguous. If you absolutely need to go with plain text, perhaps you could encode the length as a 16-bit hex value, i.e.:
00ff <255 bytes data>
000a <10 bytes data>
This way, the length of the size header is fixed to 4 bytes and can be used as a minimum read length when receiving on the socket.
Edit: Perhaps I misunderstood -- if reading the length value isn't a problem, deal with splits by concatenating incoming data to a string, byte buffer, or whatever until its length is equal to the value you read in the beginning. TCP will take care of the rest.
Take extra precautions to make sure that you can't get stuck in a blocking read state should the client not send a complete message. For example, say you receive the length header, and start a loop that keeps reading through blocking recv() calls until the buffer is filled. If a malicious client intentionally stops sending data, your server might be locked until the client either disconnects, or starts sending.
I would have a function called readBytes or something that takes a buffer and a length parameter and reads until that many bytes have been read. You'll need to capture the number of bytes actually read and if it's less than the number you're expecting, advance your buffer pointer and read the rest. Keep looping until you've read them all.
Then call this function once for the header (containing the length), assuming that the header is a fixed length. Once you have the length of the actual data, call this function again.