What exactly happens when we use mpi_send/receive functions? - c++

when we use mpi_send/receive functions what happens? I mean this communication is done by value or by address of the variable that we desire to be sent and received (for example process 0 wants send variable "a" to process 1. Process 0 what exactly sends value of variable "a" or address of "a" ). And what happens when we use derived data types for communication?

Quite a bit of magic happens behind the scenes.
First, there's the unexpected message queue. When the sender calls MPI_Send before the receiver has called MPI_Recv, MPI doesn't know where in the receiver's memory the message is going. Two things can happen at this point. If the message is short, it is copied to a temporary buffer at the receiver. When the receiver calls MPI_Recv it first checks if a matching message has already arrived, and if it has, copies the data to the final destination. If not, the information about the target buffer is stored in the queue so the MPI_Recv can be matched when the message arrives. It is possible to examine the unexpected queue with MPI_Probe.
If the message is longer than some threshold, copying it would take too long. Instead, the sender and the receiver do a handshake with a rendezvous protocol of some sort to make sure the target is ready to receive the message before it is sent out. This is especially important with a high-speed network like InfiniBand.
If the communicating ranks are on the same machine, usually the data transfer happens through shared memory. Because MPI ranks are independent processes, they do not have access to each other's memory. Instead, the MPI processes on the same node set up a shared memory region and use it to transfer messages. So sending a message involves copying the data twice: the sender copies it into the shared buffer, and the receiver copies it out into its own address space. There exists an exception to this. If the MPI library is configured to use a kernel module like KNEM, the message can be copied directly to the destination in the OS kernel. However, such a copy incurs a penalty of a system call. Copying through the kernel is usually only worth it for large messages. Specialized HPC operating systems like Catamount can change these rules.
Collective operations can be implemented either in terms of send/receive, or can have a completely separate optimized implementation. It is common to have implementations of several algorithms for a collective operation. The MPI library decides at runtime which one to use for best performance depending on the size of the messages and the communicator.
A good MPI implementation will try very hard to transfer a derived datatype without creating extra copies. It will figure out which regions of memory within a datatype are contiguous and copy them individually. However, in some cases MPI will fall back to using MPI_Pack behind the scenes to make the message contiguous, and then transfer and unpack it.

As far as the applications system programmer need be concerned these operations send and receive data, not addresses of data. MPI processes do not share an address space, so an address on process 0 is meaningless to an operation on process 1 - if process 1 wants the data at an address on process 0 it has to get it from process 0. I don't think that the single-sided communications which came in with MPI-2 materially affect this situation.
What goes on under the hood, the view from the developer of the MPI libraries, might be different and will certainly be implementation dependent. For example, if you are using a well written MPI library on a shared-memory machine then yes, it might just implement message passing by sending pointers to address locations around the system. But this is a corner case, and not much seen these days.

mpi_send requires you to give the address to the memory holding the data to be sent. It will return only when it is safe for you to re-use that memory (non-blocking communications can avoid this).
Similarly, mpi_recv requires you to give the address of sufficient memory where it can copy the data to be received into. It will return only when the data have been received into that buffer.
How MPI does that, is another matter and you don't need to worry about that for writing a working MPI program (but possibly for writing an efficient one).

Related

MPI Send and Receive modes for large number of processors

I know there are a ton of questions & answers about the different modes of MPI send and receive out there, but I believe mine is different or I am simply not able to apply these answers to my problem.
Anyway, my scenario is as follows. The code is intended for high performance clusters with potentially thousands of cores, organized into a multi-dimensional grid. In my algorithm, there are two successive operations that need to be carry out, let's call them A and B, where A precedes B. A brief description is as follows:
A: Each processor has multiple buffers. It has to send each of these buffers to a certain set of processors. For each buffer to be sent, the set of receiving processors might differ. Sending is the last step of operation A.
B: Each processor receives a set of buffers from a set of processors. Operation B will then work on these buffers once it has received all of them. The result of that operation will be stored in a fixed location (neither in the send or receive buffers)
The following properties are also given:
in A, every processor can compute which processors to send to, and it can also compute a corresponding tag in case that a processor receives multiple buffers from the same sending processor (which is very likely).
in B, every processor can also compute which processors it will receive from, and the corresponding tags that the messages were sent with.
Each processor has its own send buffers and receive buffers, and these are disjoint (i.e. there is no processor that uses its send buffer as a receive buffer as well, and vice versa).
A and B are executed in a loop among other operations before A and after B. We can ensure that the send buffer will not be used again until the next loop iteration, where it is filled with new data in A, and the receive buffers will also not be used again until the next iteration where they are used to receive new data in operation B.
The transition between A and B should, if possible, be a synchronization point, i.e. we want to ensure that all processors enter B at the same time
To send and receive, both A and B have to use (nested) loops on their own to send and receive the different buffers. However, we cannot make any assumption about the order of these send and receive statements, i.e. for any two buffers buf0 and buf1 we cannot guarantee that if buf0 is received by some processor before buf1, that buf0 was also sent before buf1. Note that at this point, using group operations like MPI_Broadcast etc. is not an option yet due to the complexity of determining the set of receiving/sending processors.
Question: Which send and receive modes should I use? I have read a lot of different stuff about these different modes, but I cannot really wrap my head around them. The most important property is deadlock-freedom, and the next important thing is performance. I am tending towards using MPI_Isend() in A without checking the request status and again using the non-blocking MPI_IRecv() in B's loops, and then using MPI_Waitall() to ensure that all buffers were received (and as a consequence, also all buffers have been sent and the processors are synchronized).
Is this the correct approach, or do I have to use buffered sends or something entirely different? I don't have a ton of experience in MPI and the documentation does not really help me much either.
From how you describe your problem, I think MPI_Isend is likely to be the best (only?) option for A because it's guaranteed non-blocking, whereas MPI_Send may be non-blocking, but only if it is able to internally buffer your send buffer.
You should then be able to use an MPI_Barrier so that all processes enter B at the same time. But this might be a performance hit. If you don't insist that all processes enter B at the same time, some can begin receiving messages sooner. Furthermore, given your disjoint send & receive buffers, this should be safe.
For B, you can use MPI_Irecv or MPI_Recv. MPI_Irecv is likely to be faster, because a standard MPI_Recv might be waiting around for a slower send from another process.
Whether or not you block on the receiving end, you ought to call MPI_Waitall before finishing the loop to ensure all send/recv operations have completed successfully.
An additional point: you could leverage MPI_ANY_SOURCE with MPI_Recv to receive messages in a blocking manner and immediately operate on them, no matter what order they arrive. However, given that you specified that no operations on the data happen until all data are received, this might not be that useful.
Finally: as mentioned in these recommendations, you will get best performance if you can restructure your code so that you can just use MPI_SSend. In this case you avoid any buffering at all. To achieve this, you'd have to have all processes first call an MPI_Irecv, then begin sending via MPI_Ssend. It might not be as hard as you think to refactor in this way, particularly if, as you say, each process can work out independently which messages it will receive from whom.

Should I use multiple threads for a multi socket client?

I understand that for most cases using threads in Qt networking is overkill and unnecessary, especially if you do it the proper way and use the readyRead() signal. However, my "client" application will have multiple sockets open (about 5) at one time. It is possible for there to be data coming in on all sockets at the same time. I am really not going to be doing any intense processing with the incoming data. Simply reading it in and then sending out a signal to update the GUI with the newly received data. Do you think a single thread application should be able to handle all of the data coming in?
I understand that I haven't shown you any code and that my description is pretty vague and it could very well depend on how it performs once implemented, but from a general design perspective and your guys' expertise, what is your opinion?
Unless you are receiving really high-bandwidth streams (e.g. megabytes per second rather than kilobytes per second), a single-threaded design should be sufficient. Keep in mind that the OS's networking stack is running "in the background" at all times, receiving TCP packets and storing the received data inside fixed-size in-kernel memory buffers. This happens in parallel with your program's execution, so in most cases the fact that your program is single-threaded and busy dealing with a GUI update (or another socket) won't hamper your computer's reception of TCP packets.
The case where a single-threaded design would cause a slowdown of TCP traffic is if your program (via Qt) didn't call recv() quickly enough, such that the kernel's TCP-receive buffer for a socket became entirely filled with data. At that point the kernel would have no choice but to start dropping incoming TCP packets for that socket, which would cause the server to have to re-send those TCP packets, and that would cause the socket's TCP receive rate to slow down, at least temporarily. However, that problem can be avoided by making sure the buffers never (or at least rarely) get full.
The obvious way to do that is to ensure that your program reads all of the incoming data as quickly as possible -- something that QTCPSocket does by default. The only thing you need to do is make sure that your GUI updates don't take an inordinate amount of time -- and Qt's widget-update routines are fairly efficient, so they shouldn't, unless you have a really elaborate GUI or an inefficient custom paintEvent() routine or etc.
If that's not sufficient, the next thing you could do (if necessary) is tell the OS's TCP stack to increase the size of its in-kernel TCP receive buffer, e.g. by doing:
int fd = myQTCPSocketObject.descriptor();
int newBufSizeBytes = 128*1024; // request 128kB kernel recv-buffer for this socket
if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &newBufSizeBytes, sizeof(newBufSizeBytes)) != 0) perror("setsockopt");
Doing that would give your (single) thread more time to react before incoming packets start getting dropped for lack of in-kernel buffer space.
If, after trying all that, you still aren't getting the network performance you need, then you can try going multithreaded. I doubt it will come to that, but if it does, it needn't affect your program's design too much; you'd just write a wrapper class (called SocketThread or something) that holds your QTCPSocket object and runs an internal thread that handles the reading from the socket, and emits a bytesReceived(QByteArray) signal whenever the thread reads data from the socket. The rest of your code would remain approximately the same; just modify it to hold the SocketThread object instead of a QTCPSocket, and connect the SocketThread's bytesReceived(QByteArray) signal to a corresponding slot (via a QueuedConnection, of course, for thread-safety) and use that instead of responding directly to readReady().
Implement it without threads, using a thread-considerate design(*), measure the delay your data experiences, decide if it is within acceptable bounds. Then decide if you need to use threads to capture it more rapidly.
From your description, the key bottleneck is going to be GUI reception of the "data ready" signal, render it. If you use the approach of sending lots of these signals, your GUI is goign to be doing more re-renders.
If you use a single-thread approach, you can marshal the network reads and get all the updates and then refresh the GUI directly. As you've described it, this sounds like it will have the least degree of contention.
(* try to avoid constructs which will require an entire rewrite if you go threaded, but don't put so much effort into making it thread-proof that it will actually need threads to make it efficient, e.g. don't wrap everything with mutex calls)
I do not know much about Qt, but this could be a typical scenario where you use select() to multiplex multiple socket accesses with a single thread.
If the thread for selecting is used mainly for handling the data from/to the sockets you will be very fast(as you will have less context switches). So if you are not transfer really huge amounts of data it is likely possible that you will be faster will a single threaded solution.
That being said, i would go with the solution that fits the most for your needs, something that you can implement in a fair amount of time. Implementing select (async) can be quite a hassle, an overkill that might not be needed.
It's a C-like approach, but i hope i could help anyway.

Boost: is there an interprocess::message_queue-like mechanism for thread-only communication?

The boost::interprocess::message_queue mechanism seems primarily designed for just that: interprocess communication.
The problem is that it serializes the objects in the message:
"A message queue just copies raw bytes between processes and does not send objects."
This makes it completely unsuitable for fast and repeated interthread communication with large composite objects being passed.
I want to create a message with a ref/shared_ptr/pointer to a known and previously-created object and safely pass it from one thread to the next.
You CAN use asio::io_service and post with bind completions, but that's rather klunky AND requires that the thread in question be using asio, which seems a bit odd.
I've already written my own, sadly based on asio::io_service, but would prefer to switch over to a boost-supported general mechansim.
You need a mechanism, that designed for interprocess communication because separate processes has separate address space and you cannot simply pass pointers except very spacial cases. For thread communication you can use standard containers like std::stack, std::queue and std::priority_queue to communicate between threads, you just need to provide proper synchronization through mutexes. Or you can use lock-free containers, which also provided by boost. What else would you need for interthread communication?
Whilst I'm no expert in Boost per se, there is a fundamental difficulty in communicating between processes and threads via a pipe, message queue, etc, especially if it is assumed that a program's data is classes containing dynamically allocated memory (which is pretty much the case for things written with Boost; a string is not a simple object like it is in C...).
Copying of Data in Classes
Message queues and pipes are indeed just a way of passing a collection of bytes from one thread/process to another thread/process. Generally when you use them you're looking for the destination thread to end up with a copy of the original data, not just a copy of the references to the data (which would be pointing back at the original data).
With a simple C struct containing no pointers at all it's easy; a copy of the struct contains all the data, no problem. But a C++ class with complex data types like strings is now a structure containing references / pointers to allocated memory. Copy that structure and you haven't actually copied the data in the allocated memory.
That's where serialisation comes in. For interprocess communications where both processes can't ordinarily share the same memory serialisation serves as a way of parcelling up the structure to be sent plus all the data it refers to into a stream of bytes that can be unpacked at the other end. For threads it's no different if you don't want the two threads accessing the same memory at the same time. Serialisation is a convenient way of saving yourself having to navigating through a class to see exactly what needs to be copied.
Efficiency
I don't know what Boost uses for serialisation, but clearly serialising to XML would be painfully inefficient. A binary serialisation like ASN.1 BER would be much faster.
Also, copying data through pipes, message queues is no longer as inefficient as it used to be. Traditionally programmers don't do it because of the perceived waste of time spent copying the data repeatedly just to share it with another thread. With a single core machine that involves a lot of slow and wasteful memory accesses.
However, if one considers what "memory access" is in these days of QPI, Hypertransport, and so forth, it's not so very different to just copying the data in the first place. In both cases it involves data being sent over a serial bus from one core's memory controller to another core's cache.
Today's CPUs are really NUMA machines with memory access protocols layered on top of serial networks to fake an SMP environment. Programming in the style of copying messages through pipes, message queues, etc. is definitely edging towards saying that one is content with the idea of NUMA, and that really you don't need SMP at all.
Also, if you do all your inter-thread communications as message queues, they're not so very different to pipes, and pipes aren't so different to network sockets (at least that's the case on Not-Windows). So if you write your code carefully you can end up with a program that can be redeployed across a distributed network of computers or across a number of threads within a single process. That's a nice way of getting scalability because you're not changing the shape or feel of your program in any significant way when you scale up.
Fringe Benefits
Depending on the serialisation technology used there can be some fringe benefits. With ASN.1 you specify a message schema in which you set out the valid ranges of the message's contents. You can say, for example, that a message contains an integer, and it can have values between 0 and 10. The encoders and decoders generated by decent ASN.1 tools will automatically check that the data you're sending or receiving meets that constraint, and returns errors if not.
I would be surprised if other serialisers like Google Protocol Buffers didn't do a similar constraints check for you.
The benefit is that if you have a bug in your program and you try and send an out of spec message, the serialiser will automatically spot that for you. That can save a ton of time in debugging. Also it is something you definitely don't get if you share a memory buffer and protect it with a semaphore instead of using a message queue.
CSP
Communicating Sequential Processes and the Actor model are based on sending copies of data through message queues, pipes, etc. just like you're doing. CSP in particular is worth paying attention to because it's a good way of avoiding a lot of the pitfalls of multi-threaded software that can lurk undetected in source code.
There are some CSP implementations you can just use. There's JCSP, a class library for Java, and C++CSP, built on top of Boost to do CSP for C++. They're both from the University of Kent.
C++CSP looks quite interesting. It has a template class called csp::mobile, which is kind of like a Boost smart pointer. If you send one of these from one thread to another via a channel (CSP's word for a message queue) you're sending the reference, not the data. However, the template records which thread 'owns' the data. So a thread receiving a mobile now owns the data (which hasn't actually moved), and the thread that sent it can no longer access it. So you get the benefits of CSP without the overhead of copying the data.
It also looks like C++CSP is able to do channels over TCP; that's a very attractive feature, up scaling is a really simple possibility. JCSP works over network connections too.

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.
Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.
Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.
I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.
Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.
It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.
I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.
I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.

Fast Cross Platform Inter Process Communication in C++

I'm looking for a way to get two programs to efficiently transmit a large amount of data to each other, which needs to work on Linux and Windows, in C++. The context here is a P2P network program that acts as a node on the network and runs continuously, and other applications (which could be games hence the need for a fast solution) will use this to communicate with other nodes in the network. If there's a better solution for this I would be interested.
boost::asio is a cross platform library handling asynchronous io over sockets. You can combine this with using for instance Google Protocol Buffers for your actual messages.
Boost also provides you with boost::interprocess for interprocess communication on the same machine, but asio lets you do your communication asynchronously and you can easily have the same handlers for both local and remote connections.
I have been using ICE by ZeroC (www.zeroc.com), and it has been fantastic. Super easy to use, and it's not only cross platform, but has support for many languages as well (python, java, etc) and even an embedded version of the library.
Well, if we can assume the two processes are running on the same machine, then the fastest way for them to transfer large quantities of data back and forth is by keeping the data inside a shared memory region; with that setup, the data is never copied at all, since both processes can access it directly. (If you wanted to go even further, you could combine the two programs into one program, with each former 'process' now running as a thread inside the same process space instead. In that case they would be automatically sharing 100% of their memory with each other)
Of course, just having a shared memory area isn't sufficient in most cases: you would also need some sort of synchronization mechanism so that the processes can read and update the shared data safely, without tripping over each other. The way I would do that would be to create two double-ended queues in the shared memory region (one for each process to send with). Either use a lockless FIFO-queue class, or give each double-ended queue a semaphore/mutex that you can use to serialize pushing data items into the queue and popping data items out of the queue. (Note that the data items you'd be putting into the queues would only be pointers to the actual data buffers, not the data itself... otherwise you'd be back to copying large amounts of data around, which you want to avoid. It's a good idea to use shared_ptrs instead of plain C pointers, so that "old" data will be automatically freed when the receiving process is done using it). Once you have that, the only other thing you'd need is a way for process A to notify process B when it has just put an item into the queue for B to receive (and vice versa)... I typically do that by writing a byte into a pipe that the other process is select()-ing on, to cause the other process to wake up and check its queue, but there are other ways to do it as well.
This is a hard problem.
The bottleneck is the internet, and that your clients might be on NAT.
If you are not talking internet, or if you explicitly don't have clients behind carrier grade evil NATs, you need to say.
Because it boils down to: use TCP. Suck it up.
I would strongly suggest Protocol Buffers on top of TCP or UDP sockets.
So, while the other answers cover part of the problem (socket libraries), they're not telling you about the NAT issue. Rather than have your users tinker with their routers, it's better to use some techniques that should get you through a vaguely sane router with no extra configuration. You need to use all of these to get the best compatibility.
First, ICE library here is a NAT traversal technique that works with STUN and/or TURN servers out in the network. You may have to provide some infrastructure for this to work, although there are some public STUN servers.
Second, use both UPnP and NAT-PMP. One library here, for example.
Third, use IPv6. Teredo, which is one way of running IPv6 over IPv4, often works when none of the above do, and who knows, your users may have working IPv6 by some other means. Very little code to implement this, and increasingly important. I find about half of Bittorrent data arrives over IPv6, for example.