I'm in complete lack of understanding in this. Maybe this is too broad for stack, but here it goes:
Suppose I have two programs (written in C/C++) running simultaneously, say A and B, with different PIDs.
What are the options to make then interact with each other. For instance, how do I pass information from one to another like having one being able to wait for a signal from the other, and respond accordingly.
I know MPI, but MPI normally works for programs that are compiled using the same source (so, it works more for parallel computing than just interaction from completely different programs built to interact with each other).
Thanks
You must lookout for "IPC" (inter process communication). There are several types:
pipes
signals
shared memory
message queues
semaphores
files (per suggestion of #JonathanLeffler :-)
RPC (suggested by #sftrabbit)
Which is usually more geared towards Client/Server
CORBA
D-Bus
You use one of the many interprocess communication mechanisms, like pipes (one applications writes bytes into a pipe, the other reads from it. Imagine stdin/stdout.) or shared memory (a region of memory is mapped into both programs virtual address space and they can communicate through it).
The same source doesn't matter - once your programs are compiled the system doesn't know or care where they came from.
There are different ways to communicate between them depending on how much data, how fast, one way or bidirectional, predicatable rate etc etc....
The simplest is possibly just to use the network - note that if you are on the same machine the network stack will automatically use some higher performance system to actually send the data (ie shared memory)
Related
I have an application A, and I want to share some information with an application B.
Application A write information each ~150ms.
Application B read information at any times.
I searched and found QSharedMemory, which looks great, but the application B will not be developed by my company, so I can't choose the programming langage.
Is QSharedMemory a good idea ?
How can I do that ?
QSharedMemory is a thin wrapper around named and unnamed platform shared memory. When named, there's simply a file that the other application can memory-map and use from any programming language, as long as said language supports binary buffers.
I do wonder if it wouldn't be easier, though, if you used a pipe for IPC. QLocalSocket encapsulates that on Qt's end, and the other side simply uses a native pipe.
Shared memory makes sense only in certain scenarios, like, say, pushing images that may not change all that much between applications - where the cost of pushing the entire image all the time would be prohibitive in the light of small-on-average bandwidth of changes. The image doesn't need to mean a visual image, it may be an industrial process image, etc.
In many cases, shared memory is a premature pseudo-optimization that makes things much harder than necessary, and can, in case of a multitude of communicating processes, become a pessimization - you do pay the cost in virtual memory for each shared memory segment.
Sounds like you need to implement a simple server, using local sockets it should be pretty fast in terms of bandwidth and easy to develop. The server will act to store data from A and deliver it to B upon request.
Obviously, it won't work "with no application" in between. Whether you go for shared memory or a local socket, you will need some server code to run at all time to service A and B. If A is running all the time, it can well be a part of it, but it can also be standalone.
It would be preferable to use a local socket, because the API for that is more portable across different programming languages, in that case A and B can be implemented in arbitrary languages and frameworks and communicate at the socket protocol level. With QSharedMemory it won't be as portable in your scenario.
I have a basic question regarding MPI, to get a better understanding of it (I am new to MPI and multiple processes so please bear with me on this one). I am using a simulation environment in C++ (RepastHPC) that makes extensive use of MPI (using the Boost libraries) to allow parallel operations. In particular, the simulation consists of multiple instances of the respective classes (i.e. agents), that are supposed to interact with each other, exchange information etc. Now given that this takes place on multiple processes (and given my rudimentary understanding of MPI) the natural question or fear I have is, that agents on different processes don't intereact with each other anymore because they cannot connect (I know, this contradicts the entire idea of MPI).
After reading the manual my understanding is this: the available libraries of Boost.MPI (and also the libaries of the above mentionend package) take care of all of the communication and sending packages back and forth between processes, i.e. each process has copies of the instances from other processes (I guess this is some form of call by value, b/c the original instance cannot be changed from a process that has only a copy), then an updating takes place, to ensure that the copies of the instances have the same information as the originals and so on.
Does this mean, that in terms of the final outcomes of the simulations runs, I get the same as if I would be doing the entire thing on one process? Put differently, the multiple processes are just supposed to speed up things but not change the design of simulation (thus I don't have to worry about it)?
I think you have a fundamental misunderstanding of MPI here. MPI is not an automatic parallelization library. It isn't a distributed shared memory mechanism. It doesn't do any magic for you.
What it does do is make it simpler to communicate between different processes on the same or different machines. Each process has its own address space which does not overlap with the other processes (unless you're doing something else outside of MPI). Assuming you set up your MPI installation correctly, it will do all of the pain of setting up the communication channels between your processes for you. It also gives you some higher level abstractions like collective communication.
When you use MPI, you compile your code differently than normal. Instead of using g++ -o code code.cpp (or whatever your compiler is), you use mpicxx -o code code.cpp. This will automatically link with all of the MPI stuff necessary. Then when you run your application, you use mpiexec -n <num_processes> ./code (other arguments aren't required, but are probably necessary) . The argument num_processes will tell MPI how many processes to launch. This isn't done at compile/link time.
You will also have to rewrite your code to use MPI. MPI has lots of functions (the standard is available here and there are lots of tutorials available on the web that are easier to understand) that you can use. The basics are MPI_Send() and MPI_Recv(), but there's lots and lots more. You'll have to find a tutorial for that.
What is the fastest technology to send messages between C++ application processes, on Linux? I am vaguely aware that the following techniques are on the table:
TCP
UDP
Sockets
Pipes
Named pipes
Memory-mapped files
are there any more ways and what is the fastest?
Whilst all the above answers are very good, I think we'd have to discuss what is "fastest" [and does it have to be "fastest" or just "fast enough for "?]
For LARGE messages, there is no doubt that shared memory is a very good technique, and very useful in many ways.
However, if the messages are small, there are drawbacks of having to come up with your own message-passing protocol and method of informing the other process that there is a message.
Pipes and named pipes are much easier to use in this case - they behave pretty much like a file, you just write data at the sending side, and read the data at the receiving side. If the sender writes something, the receiver side automatically wakes up. If the pipe is full, the sending side gets blocked. If there is no more data from the sender, the receiving side is automatically blocked. Which means that this can be implemented in fairly few lines of code with a pretty good guarantee that it will work at all times, every time.
Shared memory on the other hand relies on some other mechanism to inform the other thread that "you have a packet of data to process". Yes, it's very fast if you have LARGE packets of data to copy - but I would be surprised if there is a huge difference to a pipe, really. Main benefit would be that the other side doesn't have to copy the data out of the shared memory - but it also relies on there being enough memory to hold all "in flight" messages, or the sender having the ability to hold back things.
I'm not saying "don't use shared memory", I'm just saying that there is no such thing as "one solution that solves all problems 'best'".
To clarify: I would start by implementing a simple method using a pipe or named pipe [depending on which suits the purposes], and measure the performance of that. If a significant time is spent actually copying the data, then I would consider using other methods.
Of course, another consideration should be "are we ever going to use two separate machines [or two virtual machines on the same system] to solve this problem. In which case, a network solution is a better choice - even if it's not THE fastest, I've run a local TCP stack on my machines at work for benchmark purposes and got some 20-30Gbit/s (2-3GB/s) with sustained traffic. A raw memcpy within the same process gets around 50-100GBit/s (5-10GB/s) (unless the block size is REALLY tiny and fits in the L1 cache). I haven't measured a standard pipe, but I expect that's somewhere roughly in the middle of those two numbers. [This is numbers that are about right for a number of different medium-sized fairly modern PC's - obviously, on a ARM, MIPS or other embedded style controller, expect a lower number for all of these methods]
I would suggest looking at this also: How to use shared memory with Linux in C.
Basically, I'd drop network protocols such as TCP and UDP when doing IPC on a single machine. These have packeting overhead and are bound to even more resources (e.g. ports, loopback interface).
NetOS Systems Research Group from Cambridge University, UK has done some (open-source) IPC benchmarks.
Source code is located at https://github.com/avsm/ipc-bench .
Project page: http://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-bench/ .
Results: http://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-bench/results.html
This research has been published using the results above: http://anil.recoil.org/papers/drafts/2012-usenix-ipc-draft1.pdf
Check CMA and kdbus:
https://lwn.net/Articles/466304/
I think the fastest stuff these days are based on AIO.
http://www.kegel.com/c10k.html
As you tagged this question with C++, I'd recommend Boost.Interprocess:
Shared memory is the fastest interprocess communication mechanism. The
operating system maps a memory segment in the address space of several
processes, so that several processes can read and write in that memory
segment without calling operating system functions. However, we need
some kind of synchronization between processes that read and write
shared memory.
Source
One caveat I've found is the portability limitations for synchronization primitives. Nor OS X, nor Windows have a native implementation for interprocess condition variables, for example,
and so it emulates them with spin locks.
Now if you use a *nix which supports POSIX process shared primitives, there will be no problems.
Shared memory with synchronization is a good approach when considerable data is involved.
Well, you could simply have a shared memory segment between your processes, using the linux shared memory aka SHM.
It's quite easy to use, look at the link for some examples.
posix message queues are pretty fast but they have some limitations
My goal is to send/share data between multiple programs. These are the options I thought of:
I could use a file, but prefer to use my RAM because it's generally faster.
I could use a socket, but that would require a lot of address information which is unnecessary for local stuff. And ports too.
I could ask others about an efficient way to do this.
I chose the last one.
So, what would be an efficient way to send data from one program to another? It might use a buffer, for example, and write bytes to it and wait for the reciever to mark the first byte as 'read' (basically anything else than the byte written), then write again, but where would I put the buffer and how would I make it accessible for both programs? Or perhaps something else might work too?
I use linux.
What about fifos and pipes? if you are on a linux environment, this is the way to allow 2 programs to share data.
The fastest IPC for processes running on same host is a shared memory.
In short, several processes can access same memory segment.
See this tutorial.
You may want to take a look at Boost.Interprocess
Boost.Interprocess simplifies the use of common interprocess communication and synchronization mechanisms and offers a wide range of them:
Shared memory.
Memory-mapped files.
Semaphores, mutexes, condition variables and upgradable mutex types to place them in shared
memory and memory mapped files.
Named versions of those synchronization objects, similar to UNIX/Windows sem_open/CreateSemaphore API.
File locking.
Relative pointers.
Message queues.
To answer your questions:
Using a file is probably not the best way, and files are usually not used for passing inner-process information. Remember the os has to open, read, write, close them. They are however used for locking (http://en.wikipedia.org/wiki/File_locking).
The highest performance you get using pipestream (http://linux.die.net/man/3/popen), but in Linux it's hard to get right. You have to redirect the stdin, stdout, and stderr. This has to be done for each inner-process. So it will work well for two applications but go beyond that and it gets very hairy.
My favorite solution, use socketpairs (http://pubs.opengroup.org/onlinepubs/009604499/functions/socketpair.html). These are very robust and easy to setup. But if you use multiple applications you have to prepare some sort of pool where to access the applications.
On Linux, when using files, they are very often in cache, so you won't read the disk that often, and you could use a "RAM" filesystem like tmpfs (actually tmpfs use virtual memory, so RAM + swap, and practically the files are kept in RAM most of the time).
The main issue remains synchronization.
Using sockets (which may be, if all processes are on the same machine, AF_UNIX sockets which are faster than TCP/IP ones) has the advantage of making our code easily portable to environments where you prefer to run several processes on several machines.
And you could also use an existing framework for parallel execution, like e.g. MPI, Corba, etc etc.
You should have a gross idea of the bandwidth and latency expected from your application.
(it is not the same if you need to share dozens of megabytes every millisecond, or hundreds of kilobytes every tenths of seconds).
I would suggest learning more about serialization techniques, formats and libraries like XDR, ASN1, JSON, YAML, s11n, jsoncpp etc.
And sending or sharing data is not the same. When you send (and recieve) data, you think in terms of message passing. When you share data you think in terms of a shared memory. Programming style is very different.
Shared memory is the best for sharing the data between the processes. But it needs lots of synchronization and if more than 2 processes are sharing the data then synchronization is like a Cyclops. (Single eye - Single shared memory).
But if you make use of sockets (multicast sockets), then implementation will be little difficult, but scalability and maintainability is very easy. You no need to bother how many apps will be waiting for the data, you can just multicast and they will listen to the data and process. No need to wait for the semaphore (shared memory synchronization technique) to read the data.
So reading the data time can be reduced.
Shared memory - Wait for the semaphore, read the data and process the data.
Sockets - Receive the data, process the data.
Performance, scalability and maintainability will be added advantages with the sockets.
Regards,
SSuman185
My unix/windows C++ app is already parallelized using MPI: the job is splitted in N cpus and each chunk is executed in parallel, quite efficient, very good speed scaling, the job is done right.
But some of the data is repeated in each process, and for technical reasons this data cannot be easily splitted over MPI (...).
For example:
5 Gb of static data, exact same thing loaded for each process
4 Gb of data that can be distributed in MPI, the more CPUs are used, smaller this per-CPU RAM is.
On a 4 CPU job, this would mean at least a 20Gb RAM load, most of memory 'wasted', this is awful.
I'm thinking using shared memory to reduce the overall load, the "static" chunk would be loaded only once per computer.
So, main question is:
Is there any standard MPI way to share memory on a node? Some kind of readily available + free library ?
If not, I would use boost.interprocess and use MPI calls to distribute local shared memory identifiers.
The shared-memory would be read by a "local master" on each node, and shared read-only. No need for any kind of semaphore/synchronization, because it wont change.
Any performance hit or particular issues to be wary of?
(There wont be any "strings" or overly weird data structures, everything can be brought down to arrays and structure pointers)
The job will be executed in a PBS (or SGE) queuing system, in the case of a process unclean exit, I wonder if those will cleanup the node-specific shared memory.
One increasingly common approach in High Performance Computing (HPC) is hybrid MPI/OpenMP programs. I.e. you have N MPI processes, and each MPI process has M threads. This approach maps well to clusters consisting of shared memory multiprocessor nodes.
Changing to such a hierarchical parallelization scheme obviously requires some more or less invasive changes, OTOH if done properly it can increase the performance and scalability of the code in addition to reducing memory consumption for replicated data.
Depending on the MPI implementation, you may or may not be able to make MPI calls from all threads. This is specified by the required and provided arguments to the MPI_Init_Thread() function that you must call instead of MPI_Init(). Possible values are
{ MPI_THREAD_SINGLE}
Only one thread will execute.
{ MPI_THREAD_FUNNELED}
The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are ``funneled'' to the main thread).
{ MPI_THREAD_SERIALIZED}
The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ``serialized'').
{ MPI_THREAD_MULTIPLE}
Multiple threads may call MPI, with no restrictions.
In my experience, modern MPI implementations like Open MPI support the most flexible MPI_THREAD_MULTIPLE. If you use older MPI libraries, or some specialized architecture, you might be worse off.
Of course, you don't need to do your threading with OpenMP, that's just the most popular option in HPC. You could use e.g. the Boost threads library, the Intel TBB library, or straight pthreads or windows threads for that matter.
I haven't worked with MPI, but if it's like other IPC libraries I've seen that hide whether other threads/processes/whatever are on the same or different machines, then it won't be able to guarantee shared memory. Yes, it could handle shared memory between two nodes on the same machine, if that machine provided shared memory itself. But trying to share memory between nodes on different machines would be very difficult at best, due to the complex coherency issues raised. I'd expect it to simply be unimplemented.
In all practicality, if you need to share memory between nodes, your best bet is to do that outside MPI. i don't think you need to use boost.interprocess-style shared memory, since you aren't describing a situation where the different nodes are making fine-grained changes to the shared memory; it's either read-only or partitioned.
John's and deus's answers cover how to map in a file, which is definitely what you want to do for the 5 Gb (gigabit?) static data. The per-CPU data sounds like the same thing, and you just need to send a message to each node telling it what part of the file it should grab. The OS should take care of mapping virtual memory to physical memory to the files.
As for cleanup... I would assume it doesn't do any cleanup of shared memory, but mmaped files should be cleaned up since files are closed (which should release their memory mappings) when a process is cleaned up. I have no idea what caveats CreateFileMapping etc. have.
Actual "shared memory" (i.e. boost.interprocess) is not cleaned up when a process dies. If possible, I'd recommend trying killing a process and seeing what is left behind.
With MPI-2 you have RMA (remote memory access) via functions such as MPI_Put and MPI_Get. Using these features, if your MPI installation supports them, would certainly help you reduce the total memory consumption of your program. The cost is added complexity in coding but that's part of the fun of parallel programming. Then again, it does keep you in the domain of MPI.
MPI-3 offers shared memory windows (see e.g. MPI_Win_allocate_shared()), which allows usage of on-node shared memory without any additional dependencies.
I don't know much about unix, and I don't know what MPI is. But in Windows, what you are describing is an exact match for a file mapping object.
If this data is imbedded in your .EXE or a .DLL that it loads, then it will automatically be shared between all processes. Teardown of your process, even as a result of a crash will not cause any leaks or unreleased locks of your data. however a 9Gb .dll sounds a bit iffy. So this probably doesn't work for you.
However, you could put your data into a file, then CreateFileMapping and MapViewOfFile on it. The mapping can be readonly, and you can map all or part of the file into memory. All processes will share pages that are mapped the same underlying CreateFileMapping object. it's good practice to close unmap views and close handles, but if you don't the OS will do it for you on teardown.
Note that unless you are running x64, you won't be able to map a 5Gb file into a single view (or even a 2Gb file, 1Gb might work). But given that you are talking about having this already working, I'm guessing that you are already x64 only.
If you store your static data in a file, you can use mmap on unix to get random access to the data. Data will be paged in as and when you need access to a particular bit of the data. All that you will need to do is overlay any binary structures over the file data. This is the unix equivalent of CreateFileMapping and MapViewOfFile mentioned above.
Incidentally glibc uses mmap when one calls malloc to request more than a page of data.
I had some projects with MPI in SHUT.
As i know , there are many ways to distribute a problem using MPI, maybe you can find another solution that does not required share memory,
my project was solving an 7,000,000 equation and 7,000,000 variable
if you can explain your problem,i would try to help you
I ran into this problem in the small when I used MPI a few years ago.
I am not certain that the SGE understands memory mapped files. If you are distributing against a beowulf cluster, I suspect you're going to have coherency issues. Could you discuss a little about your multiprocessor architecture?
My draft approach would be to set up an architecture where each part of the data is owned by a defined CPU. There would be two threads: one thread being an MPI two-way talker and one thread for computing the result. Note that MPI and threads don't always play well together.