Fastest technique to pass messages between processes on Linux?

Fastest technique to pass messages between processes on Linux? - c++

What is the fastest technology to send messages between C++ application processes, on Linux? I am vaguely aware that the following techniques are on the table:
TCP
UDP
Sockets
Pipes
Named pipes
Memory-mapped files
are there any more ways and what is the fastest?

Whilst all the above answers are very good, I think we'd have to discuss what is "fastest" [and does it have to be "fastest" or just "fast enough for "?]
For LARGE messages, there is no doubt that shared memory is a very good technique, and very useful in many ways.
However, if the messages are small, there are drawbacks of having to come up with your own message-passing protocol and method of informing the other process that there is a message.
Pipes and named pipes are much easier to use in this case - they behave pretty much like a file, you just write data at the sending side, and read the data at the receiving side. If the sender writes something, the receiver side automatically wakes up. If the pipe is full, the sending side gets blocked. If there is no more data from the sender, the receiving side is automatically blocked. Which means that this can be implemented in fairly few lines of code with a pretty good guarantee that it will work at all times, every time.
Shared memory on the other hand relies on some other mechanism to inform the other thread that "you have a packet of data to process". Yes, it's very fast if you have LARGE packets of data to copy - but I would be surprised if there is a huge difference to a pipe, really. Main benefit would be that the other side doesn't have to copy the data out of the shared memory - but it also relies on there being enough memory to hold all "in flight" messages, or the sender having the ability to hold back things.
I'm not saying "don't use shared memory", I'm just saying that there is no such thing as "one solution that solves all problems 'best'".
To clarify: I would start by implementing a simple method using a pipe or named pipe [depending on which suits the purposes], and measure the performance of that. If a significant time is spent actually copying the data, then I would consider using other methods.
Of course, another consideration should be "are we ever going to use two separate machines [or two virtual machines on the same system] to solve this problem. In which case, a network solution is a better choice - even if it's not THE fastest, I've run a local TCP stack on my machines at work for benchmark purposes and got some 20-30Gbit/s (2-3GB/s) with sustained traffic. A raw memcpy within the same process gets around 50-100GBit/s (5-10GB/s) (unless the block size is REALLY tiny and fits in the L1 cache). I haven't measured a standard pipe, but I expect that's somewhere roughly in the middle of those two numbers. [This is numbers that are about right for a number of different medium-sized fairly modern PC's - obviously, on a ARM, MIPS or other embedded style controller, expect a lower number for all of these methods]

I would suggest looking at this also: How to use shared memory with Linux in C.
Basically, I'd drop network protocols such as TCP and UDP when doing IPC on a single machine. These have packeting overhead and are bound to even more resources (e.g. ports, loopback interface).

NetOS Systems Research Group from Cambridge University, UK has done some (open-source) IPC benchmarks.
Source code is located at https://github.com/avsm/ipc-bench .
Project page: http://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-bench/ .
Results: http://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-bench/results.html
This research has been published using the results above: http://anil.recoil.org/papers/drafts/2012-usenix-ipc-draft1.pdf

Check CMA and kdbus:
https://lwn.net/Articles/466304/
I think the fastest stuff these days are based on AIO.
http://www.kegel.com/c10k.html

As you tagged this question with C++, I'd recommend Boost.Interprocess:
Shared memory is the fastest interprocess communication mechanism. The
operating system maps a memory segment in the address space of several
processes, so that several processes can read and write in that memory
segment without calling operating system functions. However, we need
some kind of synchronization between processes that read and write
shared memory.
Source
One caveat I've found is the portability limitations for synchronization primitives. Nor OS X, nor Windows have a native implementation for interprocess condition variables, for example,
and so it emulates them with spin locks.
Now if you use a *nix which supports POSIX process shared primitives, there will be no problems.
Shared memory with synchronization is a good approach when considerable data is involved.

Well, you could simply have a shared memory segment between your processes, using the linux shared memory aka SHM.
It's quite easy to use, look at the link for some examples.

posix message queues are pretty fast but they have some limitations

Related

Interaction of two c/c++ programs

I'm in complete lack of understanding in this. Maybe this is too broad for stack, but here it goes:
Suppose I have two programs (written in C/C++) running simultaneously, say A and B, with different PIDs.
What are the options to make then interact with each other. For instance, how do I pass information from one to another like having one being able to wait for a signal from the other, and respond accordingly.
I know MPI, but MPI normally works for programs that are compiled using the same source (so, it works more for parallel computing than just interaction from completely different programs built to interact with each other).
Thanks

You must lookout for "IPC" (inter process communication). There are several types:
pipes
signals
shared memory
message queues
semaphores
files (per suggestion of #JonathanLeffler :-)
RPC (suggested by #sftrabbit)
Which is usually more geared towards Client/Server
CORBA
D-Bus

You use one of the many interprocess communication mechanisms, like pipes (one applications writes bytes into a pipe, the other reads from it. Imagine stdin/stdout.) or shared memory (a region of memory is mapped into both programs virtual address space and they can communicate through it).

The same source doesn't matter - once your programs are compiled the system doesn't know or care where they came from.
There are different ways to communicate between them depending on how much data, how fast, one way or bidirectional, predicatable rate etc etc....
The simplest is possibly just to use the network - note that if you are on the same machine the network stack will automatically use some higher performance system to actually send the data (ie shared memory)

Sharing data locally (like with sockets) between multiple programs in C++

My goal is to send/share data between multiple programs. These are the options I thought of:
I could use a file, but prefer to use my RAM because it's generally faster.
I could use a socket, but that would require a lot of address information which is unnecessary for local stuff. And ports too.
I could ask others about an efficient way to do this.
I chose the last one.
So, what would be an efficient way to send data from one program to another? It might use a buffer, for example, and write bytes to it and wait for the reciever to mark the first byte as 'read' (basically anything else than the byte written), then write again, but where would I put the buffer and how would I make it accessible for both programs? Or perhaps something else might work too?
I use linux.

What about fifos and pipes? if you are on a linux environment, this is the way to allow 2 programs to share data.

The fastest IPC for processes running on same host is a shared memory.
In short, several processes can access same memory segment.
See this tutorial.

You may want to take a look at Boost.Interprocess
Boost.Interprocess simplifies the use of common interprocess communication and synchronization mechanisms and offers a wide range of them:
Shared memory.
Memory-mapped files.
Semaphores, mutexes, condition variables and upgradable mutex types to place them in shared
memory and memory mapped files.
Named versions of those synchronization objects, similar to UNIX/Windows sem_open/CreateSemaphore API.
File locking.
Relative pointers.
Message queues.

To answer your questions:
Using a file is probably not the best way, and files are usually not used for passing inner-process information. Remember the os has to open, read, write, close them. They are however used for locking (http://en.wikipedia.org/wiki/File_locking).
The highest performance you get using pipestream (http://linux.die.net/man/3/popen), but in Linux it's hard to get right. You have to redirect the stdin, stdout, and stderr. This has to be done for each inner-process. So it will work well for two applications but go beyond that and it gets very hairy.
My favorite solution, use socketpairs (http://pubs.opengroup.org/onlinepubs/009604499/functions/socketpair.html). These are very robust and easy to setup. But if you use multiple applications you have to prepare some sort of pool where to access the applications.

On Linux, when using files, they are very often in cache, so you won't read the disk that often, and you could use a "RAM" filesystem like tmpfs (actually tmpfs use virtual memory, so RAM + swap, and practically the files are kept in RAM most of the time).
The main issue remains synchronization.
Using sockets (which may be, if all processes are on the same machine, AF_UNIX sockets which are faster than TCP/IP ones) has the advantage of making our code easily portable to environments where you prefer to run several processes on several machines.
And you could also use an existing framework for parallel execution, like e.g. MPI, Corba, etc etc.
You should have a gross idea of the bandwidth and latency expected from your application.
(it is not the same if you need to share dozens of megabytes every millisecond, or hundreds of kilobytes every tenths of seconds).
I would suggest learning more about serialization techniques, formats and libraries like XDR, ASN1, JSON, YAML, s11n, jsoncpp etc.
And sending or sharing data is not the same. When you send (and recieve) data, you think in terms of message passing. When you share data you think in terms of a shared memory. Programming style is very different.

Shared memory is the best for sharing the data between the processes. But it needs lots of synchronization and if more than 2 processes are sharing the data then synchronization is like a Cyclops. (Single eye - Single shared memory).
But if you make use of sockets (multicast sockets), then implementation will be little difficult, but scalability and maintainability is very easy. You no need to bother how many apps will be waiting for the data, you can just multicast and they will listen to the data and process. No need to wait for the semaphore (shared memory synchronization technique) to read the data.
So reading the data time can be reduced.
Shared memory - Wait for the semaphore, read the data and process the data.
Sockets - Receive the data, process the data.
Performance, scalability and maintainability will be added advantages with the sockets.
Regards,
SSuman185

performance penalty of message passing as opposed to shared data

There is a lot of buzz these days about not using locks and using Message passing approaches like Erlang. Or about using immutable datastructures like in Functional programming vs. C++/Java.
But what I am concerned with is the following:
AFAIK, Erlang does not guarantee Message delivery. Messages might be lost. Won't the algorithm and code bloat and be complicated again if you have to worry about loss of messages? Whatever distributed algorithm you use must not depend on guaranteed delivery of messages.
What if the Message is a complicated object? Isn't there a huge performance penalty in copying and sending the messages vs. say keeping it in a shared location (like a DB that both processes can access)?
Can you really totally do away with shared states? I don't think so. For e.g. in a DB, you have to access and modify the same record. You cannot use message passing there. You need to have locking or assume Optimistic concurrency control mechanisms and then do rollbacks on errors. How does Mnesia work?
Also, it is not the case that you always need to worry about concurrency. Any project will also have a large piece of code that doesn't have to do anything with concurrency or transactions at all (but they do have performance and speed as a concern). A lot of these algorithms depend on shared states (that's why pass-by-reference or pointers are so useful).
Given this fact, writing programs in Erlang etc is a pain because you are prevented from doing any of these things. May be, it makes programs robust, but for things like Solving a Linear Programming problem or Computing the convex hulll etc. performance is more important and forcing immutability etc. on the algorithm when it has nothing to do with Concurrency/Transactions is a poor decision. Isn't it?

That's real life : you need to account for this possibility regardless of the language / platform. In a distributed world (the real world), things fail: live with it.
Of course there is a cost: nothing is free in our universe. But shouldn't you use another medium (e.g. file, db) instead of shuttling "big objects" in communication pipes? You can always use "message" to refer to "big objects" stored somewhere.
Of course not: the idea behind functional programming / Erlang OTP is to "isolate" as much as possible the areas were "shared state" is manipulated. Futhermore, having clearly marked places where shared state is mutated helps testability & traceability.
I believe you are missing the point: there is no such thing as a silver bullet. If your application cannot be successfully built using Erlang then don't do it. You can always some other part of the overall system in another fashion i.e. use a different language / platform. Erlang is no different from another language in this respect: use the right tool for the right job.
Remember: Erlang was designed to help solve concurrent, asynchronous and distributed problems. It isn't optimized for working efficiently on a shared block of memory for example... unless you count interfacing with nif functions working on shared blocks part of the game :-)

Real-world systems are always hybrids anyway: I don't believe the modern paradigms try, in practice, to get rid of mutable data and shared state.
The objective, however, is not to need concurrent access to this shared state. Programs can be divided into the concurrent and the sequential, and use message-passing and the new paradigms for the concurrent parts.
Not every code will get the same investment: There is concern that threads are fundamentally "considered harmful". Something like Apache may need traditional concurrent threads and a key piece of technology like that may be carefully refined over a period of years so it can blast away with fully concurrent shared state. Operating system kernels are another example where "solve the problem no matter how expensive it is" may make sense.
There is no benefit to fast-but-broken: But for new code, or code that doesn't get so much attention, it may be the case that it simply isn't thread-safe, and it will not handle true concurrency, and so the relative "efficiency" is irrelevant. One way works, and one way doesn't.
Don't forget testability: Also, what value can you place on testing? Thread-based shared-memory concurrency is simply not testable. Message-passing concurrency is. So now you have the situation where you can test one paradigm but not the other. So, what is the value in knowing that the code has been tested? The danger in not even knowing if the other code will work in every situation?

A few comments on the misunderstanding you have of Erlang:
Erlang guarantees that messages will not be lost, and that they will arrive in the order sent. A basic error situation is that machine A can not speak to machine B. When that happens process monitors and links will trigger, and system node-down messages will be sent to the processes that registered for it. Nothing will be silently dropped. Processes will "crash" and supervisors (if any) tries to restart them.
Objects can not be mutated, so they are always copied. One way to secure immutability is by copying values to other erlang process' heaps. Another way is to allocate objects in a shared heap, message references to them and simply not have any operations that mutate them. Erlang does the first for performance! Realtime suffers if you need to stop all processes to garbage collect a shared heap. Ask Java.
There is shared state in Erlang. Erlang is not proud of it, but it is pragmatic about it. One example is the local process registry which is a global map that maps a name to a process so that system processes can be restarted and claim their old name. Erlang just tries to avoid shared state if it possibly can. ETS tables that are public are another example.
Yes, sometimes Erlang is too slow. This happens all languages. Sometimes Java is too slow. Sometimes C++ is too slow. Just because a tight loop in a game had to drop down to assembly to kick off some serious SIMD-based vector mathematics you can't deduce that everything should be written in assembly because it is the only language that is fast when it matters. What matters is being able to write systems that have good performance, and Erlang manages quite well. See benchmarks on yaws or rabbitmq.
Your facts are not facts about Erlang. Even if you think Erlang programming is a pain, you will find other people create some awesome software thanks to it. You should attempt writing an IRC server in Erlang, or something else very concurrent. Even if you're never going to use Erlang again, you would have learned to think about concurrency another way. But of course, you will, because Erlang is awesome easy.
Those that do not understand Erlang are doomed to re-implement it badly.
Okay, the original was about Lisp, but... its true!

There are some implicit assumption in your questions - you assume that all the data can fit
on one machine and that the application is intrinsically localised to one place.
What happens if the application is so large it cannot fit on one machine? What happens if the application outgrows one machine?
You don't want to have one way to program an application if it fits on one machine and
a completely different way of programming it as soon as it outgrows one machine.
What happens if you want make a fault-tolerant application? To make something fault-tolerant you need at least two physically separated machines and no sharing.
When you talk about sharing and data bases you omit to mention that things like mySQL
cluster achieve fault-tolerence precisely by maintaining synchronised copies of the
data in physically separated machines - there is a lot of message passing and
copying that you don't see on the surface - Erlang just exposes this.
The way you program should not suddenly change to accommodate fault-tolerance and scalability.
Erlang was designed primarily for building fault-tolerant applications.
Shared data on a multi-core has it's own set of problems - when you access shared data
you need to acquire a lock - if you use a global lock (the easiest approach) you can end up
stopping all the cores while you access the shared data. Shared data access on a multicore
can be problematic due to caching problems, if the cores have local data caches then accessing "far away" data (in some other processors cache) can be very expensive.
Many problems are intrinsically distributed and the data is never available in one place
at the same time so - these kind of problems fit well with the Erlang way of thinking.
In a distributed setting "guaranteeing message delivery" is impossible - the destination machine might have crashed. Erlang cannot thus guarantee message delivery -
it takes a different approach - the system will tell you if it failed to deliver a message
(but only if you have used the link mechanism) - then you can write you own custom error
recovery.)
For pure number crunching Erlang is not appropriate - but in a hybrid system Erlang
is good at managing how computations get distributed to available processors, so we see a lot of systems where Erlang manages the distribution and fault-tolerent aspects of the problem, but the problem itself is solved in a different language.
and other languages are used

For e.g. in a DB, you have to access and modify the same record
But that is handled by the DB. As a user of the database, you simply execute your query, and the database ensures it is executed in isolation.
As for performance, one of the most important things about eliminating shared state is that it enables new optimizations. Shared state is not particularly efficient. You get cores fighting over the same cache lines, and data has to be written through to memory where it could otherwise stay in a register or in CPU cache.
Many compiler optimizations rely on absence of side effects and shared state as well.
You could say that a stricter language guaranteeing these things requires more optimizations to be performant than something like C, but it also makes these optimizations much much easier for the compiler to implement.
Many concerns similar to concurrency issues arise in singlethreaded code. Modern CPUs are pipelined, execute instructions out of order, and can run 3-4 of them per cycle. So even in a single-threaded program, it is vital that the compiler and CPU is able to determine which instructions can be interleaved and executed in parallel.

For correctness, shared is the way to go, and keep the data as normalized as possible. For immediacy, send messages to inform of changes, but always back them up with polling. Messages get dropped, duplicated, re-ordered, delayed - don't rely on them.
If speed is what you're worried about, first do it single-thread and tune the daylights out of it. Then if you've got multiple cores and know how to split up the work, use parallelism.

Erlang provides supervisors and gen_server callbacks for synchronous calls, so you will know about it if a message isn't delivered: either the gen_server call returns a timeout, or your whole node will be brought down and up if the supervisor is triggered.
usually if the processes are on the same node, message-passing languages optimise away the data copying, so it's almost like shared memory, except if the object is changed used by both afterward, which can not be done using shared memory either anyways
There is some state which is kept by processes by passing it around to themselves in the recursive tail-calls, also some state can be of course passed through messages. I don't use mnesia much, but it is a transactional database, so once you have passed the operation to mnesia (and it has returned) you are pretty much guaranteed it will go through..
Which is why it is easy to tie such applications into erlang with the use of ports or drivers. The easiest are the ports, it's much like a unix pipe, though I think performance isn't that great...and as said, message-passing usually ends up just being pointer passing anyways as the VM/compiler optimise the memory copy out.

Communication between processes

I'm looking for some data to help me decide which would be the better/faster for communication between two independent processes on Linux:
TCP
Named Pipes
Which is worse: the system overhead for the pipes or the tcp stack overhead?
Updated exact requirements:
only local IPC needed
will mostly be a lot of short messages
no cross-platform needed, only Linux

In the past I've used local domain sockets for that sort of thing. My library determined whether the other process was local to the system or remote and used TCP/IP for remote communication and local domain sockets for local communication. The nice thing about this technique is that local/remote connections are transparent to the rest of the application.
Local domain sockets use the same mechanism as pipes for communication and don't have the TCP/IP stack overhead.

I don't really think you should worry about the overhead (which will be ridiculously low). Did you make sure using profiling tools that the bottleneck of your application is likely to be TCP overhead?
Anyways as Carl Smotricz said, I would go with sockets because it will be really trivial to separate the applications in the future.

I discussed this in an answer to a previous post. I had to compare socket, pipe, and shared memory communications. Pipes were definitely faster than sockets (maybe by a factor of 2 if I recall correctly ... I can check those numbers when I return to work). But those measurements were just for the pure communication. If the communication is a very small part of the overall work, then the difference will be negligible between the two types of communication.
Edit
Here are some numbers from the test I did a few years ago. Your mileage may vary (particularly if I made stupid programming errors). In this specific test, a "client" and "server" on the same machine echoed 100 bytes of data back and forth. It made 10,000 requests. In the document I wrote up, I did not indicate the specs of the machine, so it is only the relative speeds that may be of any value. But for the curious, the times given here are the average cost per request:
TCP/IP: .067 ms
Pipe with I/O Completion Ports: .042 ms
Pipe with Overlapped I/O: .033 ms
Shared Memory with Named Semaphore: .011 ms

There will be more overhead using TCP - that will involve breaking the data up into packets, calculating checksums and handling acknowledgement, none of which is necessary when communicating between two processes on the same machine. Using a pipe will just copy the data into and out of a buffer.

I don't know if this suites you, but a very common way of IPC (interprocess communication) under linux is by using the shared memory. It's actually ultra fast (I didn't profiled this, but this is just shared data on RAM with strong processing around it).
The main problem around this approuch is the semaphore, you must build a little system around it so you must make sure a process is not writing at the same time the other one is trying to read.
A very simple starter tutorial is at here
This is not as portable as using sockets, but the concept would be the same, so if you're migrating this to Windows, you will just have to change the shared memory create/attach layer.

Two things to consider:
Connection setup cost
Continuous Communication cost
On TCP:
(1) more costly - 3way handshake overhead required for (potentially) unreliable channel.
(2) more costly - IP level overhead (checksum etc.), TCP overhead (sequence number, acknowledgement, checksum etc.) pretty much all of which aren't necessary on the same machine because the channel is supposed to be reliable and not introduce network related impairments (e.g. packet reordering).
But I would still go with TCP provided it makes sense (i.e. depends on the situation) because of its ubiquity (read: easy cross-platform support) and the overhead shouldn't be a problem in most cases (read: profile, don't do premature optimization).
Updated: if cross-platform support isn't required and the accent is on performance, then go with named/domain pipes as I am pretty sure the platform developers will have optimize-out the unnecessary functionality deemed required for handling network level impairments.

unix domain socket is a very goog compromise. Not the overhead of tcp, but more evolutive than the pipe solution. A point you did not consider is that socket are bidirectionnal, while named pipes are unidirectionnal.

I think the pipes will be a little lighter, but I'm just guessing.
But since pipes are a local thing, there's probably a lot less complicated code involved.
Other people might tell you to try and measure both to find out. It's hard to go wrong with this answer, but you may not be willing to invest the time. That would leave you hoping my guess is correct ;)

Best way for interprocess communication in C++

I have two processes one will query other for data.There will be huge amount of queries in a limited time (10000 per second) and data (>100 mb) will be transferred per second.Type of data will be an integral type(double,int)
My question is in which way to connect this process?
Shared memory , message queue , lpc(Local Procedure call) or others....
And also i want to ask which library you suggest? by the way please do not suggest MPI.
edit : under windows xp 32 bit

One Word: Boost.InterProcess. If it really needs to be fast, shared memory is the way to go. You nearly have zero overhead as the operation system does the usual mapping between virtual and physical addresses and no copy is required for the data. You just have to lookout for concurrency issues.
For actually sending commands like shutdown and query, I would use message queues. I previously used localhost network programming to do that, and used manual shared memory allocation, before i knew about boost. Damn if i would need to rewrite the app, I would immediately pick boost. Boost.InterProcess makes this more easy for you. Check it out.

I would use shared memory to store the data, and message queues to send the queries.

I'll second Marc's suggestion -- I'd not bother with boost unless you have a portability concern or want to do cool stuff like map standard container types over shared memory (in which case I'd definitely use boost).
Otherwise, message queues and shared memory are pretty simple to deal with.

If your data consists of multiple types and/or you need things like mutex, use Boost.
Else use a shared section of memory using #pragma data_seg or a memory mapped file.

If you do use shared memory you will have to decide whether or not to spin or not. I'd expect that if you use a semaphore for synchronization and storing data in shared memory you will not get much performance benefit compared to using message queues (at significant clarity degradation), but if you spin on an atomic variable for synchronization, then you have to suffer the consequences of that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js