Can Boost ASIO be used to build low-latency applications? - c++

Can Boost ASIO be used to build low-latency applications, such as HFT (High Frequency Trading)?
So Boost.ASIO uses platform-specific optimal demultiplexing mechanism: IOCP, epoll, kqueue, poll_set, /dev/poll
Also can be used Ethernet-Adapter with TOE (TCP/IP offload engine) and OpenOnload (kernel-bypass BSD sockets).
But can Low-latency application be built by using Boost.ASIO + TOE + OpenOnload?

This is the advice from the Asio author, posted to the public SG-14 Google Group (which unfortunately is having issues, and they have moved to another mailing list system):
I do work on ultra low latency financial markets systems. Like many
in the industry, I am unable to divulge project specifics. However, I
will attempt to answer your question.
In general:
At the lowest latencies you will find hardware based solutions.
Then: Vendor-specific kernel bypass APIs. For example where you encode and decode frames, or use a (partial) TCP/IP stack
implementation that does not follow the BSD socket API model.
And then: Vendor-supplied drop-in (i.e. LD_PRELOAD) kernel bypass libraries, which re-implement the BSD socket API in a way that is
transparent to the application.
Asio works very well with drop-in kernel bypass libraries. Using
these, Asio-based applications can implement standard financial
markets protocols, handle multiple concurrent connections, and expect
median 1/2 round trip latencies of ~2 usec, low jitter and high
message rates.
My advice to those using Asio for low latency work can be summarised
as: "Spin, pin, and drop-in".
Spin: Don't sleep. Don't context switch. Use io_service::poll()
instead of io_service::run(). Prefer single-threaded scheduling.
Disable locking and thread support. Disable power management. Disable
C-states. Disable interrupt coalescing.
Pin: Assign CPU affinity. Assign interrupt affinity. Assign memory to
NUMA nodes. Consider the physical location of NICs. Isolate cores from
general OS use. Use a system with a single physical CPU.
Drop-in: Choose NIC vendors based on performance and availability of
drop-in kernel bypass libraries. Use the kernel bypass library.
This advice is decoupled from the specific protocol implementation
being used. Thus, as a Beast user you could apply these techniques
right now, and if you did you would have an HTTP implementation with
~10 usec latency (N.B. number plucked from air, no actual benchmarking
performed). Of course, a specific protocol implementation should still
pay attention to things that may affect latency, such as encoding and
decoding efficiency, memory allocations, and so on.
As far as the low latency space is concerned, the main things missing
from Asio and the Networking TS are:
Batching datagram syscalls (i.e. sendmmsg, recvmmsg).
Certain socket options.
These are not included because they are (at present) OS-specific and
not part of POSIX. However, Asio and the Networking TS do provide an
escape hatch, in the form of the native_*() functions and the
"extensible" type requirements.
Cheers, Chris

I evaluated Boost Asio for use in high frequency trading a few years ago. To the best of my knowledge the basics are still the same today. Here are some reasons why I decided not to use it:
Asio relies on bind() style callbacks. There is some overhead here.
It is not obvious how to arrange certain low-level operations to occur at the right moment or in the right way.
There is rather a lot of complex code in an area which is important to optimize. It is harder to optimize complex, general code for specific use cases. Thinking that you will not need to look under the covers would be a mistake.
There is little to no need for portability in HFT applications. In particular, having "automatic" selection of a multiplexing mechanism is contrary to the mission, because each mechanism must be tested and optimized separately--this creates more work rather than reducing it.
If a third-party library is to be used, others such as libev, libevent, and libuv are more battle-hardened and avoid some of these downsides.
Related: C++ Socket Server - Unable to saturate CPU

Related

How can I distinguish between high- and low-performance cores/threads in C++?

When talking about multi-threading, it often seems like threads are treated as equal - just the same as the main thread, but running next to it.
On some new processors, however, such as the Apple "M" series and the upcoming Intel Alder Lake series not all threads are equally as performant as these chips feature separate high-performance cores and high-efficiency, slower cores.
It’s not to say that there weren’t already things such as hyper-threading, but this seems to have a much larger performance implication.
Is there a way to query std::thread‘s properties and enforce on which cores they’ll run in C++?
How to distinguish between high- and low-performance cores/threads in C++?
Please understand that "thread" is an abstraction of the hardware's capabilities and that something beyond your control (the OS, the kernel's scheduler) is responsible for creating and managing this abstraction. "Importance" and performance hints are part of that abstraction (typically presented in the form of a thread priority).
Any attempt to break the "thread" abstraction (e.g. determine if the core is a low-performance or high-performance core) is misguided. E.g. OS could change your thread to a low performance core immediately after you find out that you were running on a high performance core, leading you to assume that you're on a high performance core when you are not.
Even pinning your thread to a specific core (in the hope that it'll always be using a high-performance core) can/will backfire (cause you to get less work done because you've prevented yourself from using a "faster than nothing" low-performance core when high-performance core/s are busy doing other work).
The biggest problem is that C++ creates a worse abstraction (std::thread) on top of the "likely better" abstraction provided by the OS. Specifically, there's no way to set, modify or obtain the thread priority using std::thread; so you're left without any control over the "performance hints" that are necessary (for the OS, scheduler) to make good "load vs. performance vs. power management" decisions.
When talking about multi-threading, it often seems like threads are treated as equal
Often people think we're still using time-sharing systems from the 1960s. Stop listening to these fools. Modern systems do not allow CPU time to be wasted on unimportant work while more important work waits. Effective use of thread priorities is a fundamental performance requirement. Everything else ("load vs. performance vs. power management" decisions) is, by necessity, beyond your control (on the other side of the "thread" abstraction you're using).
Is there any way to query std::thread‘s properties and enforce on which cores they’ll run in C++?
No. There is no standard API for this in C++.
Platform-specific APIs do have the ability to specify a specific logical core (or a set of such cores) for a software thread. For example, GNU has pthread_setaffinity_np.
Note that this allows you to specify "core 1" for your thread, but that doesn't necessarily help with getting the "performance" core unless you know which core that is. To figure that out, you may need to go below OS level and into CPU-specific assembly programming. In the case of Intel to my understanding, you would use the Enhanced Hardware Feedback Interface.
No, the C++ standard library has no direct way to query the sub-type of CPU, or state you want a thread to run on a specific CPU.
But std::thread (and jthread) does have .native_handle(), which on most platforms will let you do this.
If you know the threading library implementation of your std::thread, you can use native_handle() to get at the underlying primitives, then use the underlying threading library to do this kind of low-level work.
This will be completely non-portable, of course.
iPhones, iPads, and newer Macs have high- and low-performance cores for a reason. The low-performance cores allow some reasonable amount of work to be done while using the smallest possible amount of energy, making the battery of the device last longer. These additional cores are not there just for fun; if you try to get around them, you can end up with a much worse experience for the user.
If you use the C++ standard library for running multiple threads, the operating system will detect what you are doing, and act accordingly. If your task only takes 10ms on a high-performance core, it will be moved to a low-performance core; it's fast enough and saves battery life. If you have multiple threads using 100% of the CPU time, the high-performance cores will be used automatically (plus the low-performance cores as well). If your battery runs low, the device can switch to all low-performance cores which will get more work done with the battery charge you have.
You should really think about what you want to do. You should put the needs of the user ahead of your perceived needs. Apart from that, Apple recommends assigning OS-specific priorities to your threads, which improves behaviour if you do it right. Giving a thread the highest priority so you can get better benchmark results is usually not "doing it right".
You can't select the core that a thread will be physically scheduled to run on using std::thread. See here for more. I'd suggest using a framework like OpenMP, MPI, or you will have dig into the native Mac OS APIs to select the core for your thread to execute on.
macOS provides a notion of "Quality of Service" for tasks, task queues and run loops, and threads. If you use libdispatch/GCD then the queue priorities map to the QoS as well. This article describes the QoS system in detail.
Using the macOS pthreads interface you can set a thread QoS before creating a thread, query a thread's QoS, or temporarily override a thread's QoS level (not visible in the query function though) using the non-portable functions in pthread/qos.h
This system by no means offers guarantees about how your threads will be scheduled, but can be used to make a hint to the scheduler.
I'm not aware of any way to get a similar interface on other systems, but that doesn't mean they don't exist. I imagine they'll become more widely discussed as these hybrid CPUs befome more common.
EDIT: Intel provides information here about how to query this information for their hybrid processors on Windows and for the current CPU using cpuid, haven't had a chance to play with this though.

How to use ZeroMQ for multiple Server-Client pairs?

I'm implementing a performance heavy two-party protocol in C++14 utilising multithreading and am currently using ZeroMQ as a network layer.
The application has the following simple architecture:
One main server-role,
One main client-role,
Both server and client spawn a fixed number n of threads
All n parallel concurrent thread-pairs execute some performance and communication heavy mutual, but exclusive, protocol exchange, i.e. these run in n fixed pairs and should not mix / interchange any data but with the pairwise fixed-opponent.
My current design uses a single ZeroMQ Context()-instance on both server and client, that is shared between all n-local threads and each respective client/server thread-pair creates a ZMQ_PAIR socket ( I just increment the port# ) on the local, shared, context for communication.
My question
Is there is a smarter or more efficient way of doing this?
i.e.: is there a natural way of using ROUTERS and DEALERS that might increase performance?
I do not have much experience with socket programming and with my approach the number of sockets scales directly with n ( a number of client-server thread-pairs ). This might go to the couple of thousands and I'm unsure if this is a problem or not.
I have control of both the server and client machines and source code and I have no outer restrictions that I need to worry about. All I care about is performance.
I've looked through all the patterns here, but I cannot find anyone that matches the case where the client-server pairs are fixed, i.e. I cannot use load-balancing and such.
Happy man!
ZeroMQ is a lovely and powerful tool for highly scaleable, low-overheads, Formal Communication ( behavioural, yes emulating some sort of the peers mutual behaviour "One Asks, the other Replies" et al ) Patterns.
Your pattern is quite simple, behaviourally-unrestricted and ZMQ_PAIR may serve well for this.
Performance
There ought be some more details on quantitative nature of this attribute.
a process-to-process latency [us]
a memory-footprint of a System-under-Test (SuT) architecture [MB]
a peak-amount of data-flow a SuT can handle [MB/s]
Performance Tips ( if quantitatively supported by observed performance data )
may increase I/O-performance by increasing Context( nIOthreads ) on instantiation
may fine-tune I/O-performance by hard-mapping individual thread# -> Context.IO-thread# which is helpfull for both distributed workload and allows one to keep "separate" localhost IOthread(s) free / ready for higher-priority signalling and others such needs.
shall setup application-specific ToS-labeling of prioritised types of traffic, so as to allow advanced processing on the network-layer alongside the route-segments between the client and server
if memory-footprint hurts ( ZeroMQ is not Zero-copy on TCP-protocol handling at operating-system kernel level ) one may try to move to a younger sister of ZeroMQ -- authored by Martin SUSTRIK, a co-father of ZeroMQ -- a POSIX compliant nanomsg with similar motivation and attractive performance figures. Worth to know about, at least.
Could ROUTER or DEALER increase an overall performance?
No, could not. Having in mind your stated architecture ( declared to be communication heavy ), other, even more sophisticated Scaleable Formal Communication Patterns behaviours that suit some other needs, do not add any performance benefit, but on the contrary, would cost you additional processing overheads without delivering any justifying improvement.
While your Formal Communication remains as defined, no additional bells and whistles are needed.
One point may be noted on ZMQ_PAIR archetype, some sources quote this to be rather an experimental archetype. If your gut sense feeling does not make you, besides SuT-testing observations, happy to live with this, do not mind a slight re-engineering step, that will keep you with all the freedom of un-pre-scribed Formal Communication Pattern behaviour, while having "non"-experimental pipes under the hood -- just replace the solo ZMQ_PAIR with a pair of ZMQ_PUSH + ZMQ_PULL and use messages with just a one-way ticket. Having the stated full-control of the SuT-design and implementation, this would be all within your competence.
How fast could I go?
There are some benchmark test records published for both the ZeroMQ or nanomsg performance / latency envelopes for un-loaded network transports across the traffic-free route-segments ( sure ).
If your SuT-design strives to go even faster -- say under some 800 ns end-to-end, there are other means to achieve this, but your design will have to follow other distributed computing strategy than a message-based data exchange and your project budget will have to adjust for additional expenditure for necessary ultra-low-latency hardware infrastructure.
It may surprise, but definitely well doable and pretty attractive for systems, where hundreds of nanoseconds are a must-have target within a Colocation Data Centre.

C/C++ technologies involved in sending data across networks very fast

In terms of low latency (I am thinking about financial exchanges/co-location- people who care about microseconds) what options are there for sending packets from a C++ program on two Unix computers?
I have heard about kernel bypass network cards, but does this mean you program against some sort of API for the card? I presume this would be a faster option in comparison to using the standard Unix berkeley sockets?
I would really appreciate any contribution, especially from persons who are involved in this area.
EDITED from milliseconds to microseconds
EDITED I am kinda hoping to receive answers based more upon C/C++, rather than network hardware technologies. It was intended as a software question.
UDP sockets are fast, low latency, and reliable enough when both machines are on the same LAN.
TCP is much slower than UDP but when the two machines are not on the same LAN, UDP is not reliable.
Software profiling will stomp obvious problems with your program. However, when you are talking about network performance, network latency is likely to be you largest bottleneck. If you are using TCP, then you want to do things that avoid congestion and loss on your network to prevent retransmissions. There are a few things to do to cope:
Use a network with bandwidth and reliability guarantees.
Properly size your TCP parameters to maximize utilization without incurring loss.
Use error correction in your data transmission to correct for the small amount of loss you might encounter.
Or you can avoid using TCP altogether. But if reliability is required, you will end up implementing much of what is already in TCP.
But, you can leverage existing projects that have already thought through a lot of these issues. The UDT project is one I am aware of, that seems to be gaining traction.
At some point in the past, I worked with a packet sending driver that was loaded into the Windows kernel. Using this driver it was possible to generate stream of packets something 10-15 times stronger (I do not remember exact number) than from the app that was using the sockets layer.
The advantage is simple: The sending request comes directly from the kernel and bypasses multiple layers of software: sockets, protocol (even for UDP packet simple protocol driver processing is still needed), context switch, etc.
Usually reduced latency comes at a cost of reduced robustness. Compare for example the (often greatly advertised) fastpath option for ADSL. The reduced latency due to shorter packet transfer times comes at a cost of increased error susceptibility. Similar technologies migt exist for a large number of network media. So it very much depends on the hardware technologies involved. Your question suggests you're referring to Ethernet, but it is unclear whether the link is Ethernet-only or something else (ATM, ADSL, …), and whether some other network technology would be an option as well. It also very much depends on geographical distances.
EDIT:
I got a bit carried away with the hardware aspects of this question. To provide at least one aspect tangible at the level of application design: have a look at zero-copy network operations like sendfile(2). They can be used to eliminate one possible cause of latency, although only in cases where the original data came from some source other than the application memory.
As my day job, I work for a certain stock exchange. Below answer is my own opinion from the software solutions which we provide exactly for this kind of high throughput low latency data transfer. It is not intended in any way to be taken as marketing pitch(please i am a Dev.)This is just to give what are the Essential components of the software stack in this solution for this kind of fast data( Data could be stock/trading market data or in general any data):-
1] Physical Layer - Network interface Card in case of a TCP-UDP/IP based Ethernet network, or a very fast / high bandwidth interface called Infiniband Host Channel Adaptor. In case of IP/Ethernet software stack, is part of the OS. For Infiniband the card manufacturer (Intel, Mellanox) provide their Drivers, Firmware and API library against which one has to implement the socket code(Even infiniband uses its own 'socketish' protocol for network communications between 2 nodes.
2] Next layer above the physical layer we have is a Middleware which basically abstracts the lower network protocol nittigritties, provides some kind of interface for data I/O from physical layer to application layer. This layer also provides some kind of network data quality assurance (IF using tCP)
3] Last layer would be a application which we provide on top of middleware. Any one who gets 1] and 2] from us, can develop a low latency/hight throughput 'data transfer of network' kind of app for stock trading, algorithmic trading kind os applications using a choice of programming language interfaces - C,C++,Java,C#.
Basically a client like you can develop his own application in C,C++ using the APIs we provide, which will take care of interacting with the NIC or HCA(i.e. the actual physical network interface) to send and receive data fast, really fast.
We have a comprehensive solution catering to different quality and latency profiles demanded by our clients - Some need Microseconds latency is ok but they need high data quality/very little errors; Some can tolerate a few errors, but need nano seconds latency, Some need micro seconds latency, no errors tolerable, ...
If you need/or are interested in any way in this kind of solution , ping me offline at my contacts mentioned here at SO.

Communication between processes

I'm looking for some data to help me decide which would be the better/faster for communication between two independent processes on Linux:
TCP
Named Pipes
Which is worse: the system overhead for the pipes or the tcp stack overhead?
Updated exact requirements:
only local IPC needed
will mostly be a lot of short messages
no cross-platform needed, only Linux
In the past I've used local domain sockets for that sort of thing. My library determined whether the other process was local to the system or remote and used TCP/IP for remote communication and local domain sockets for local communication. The nice thing about this technique is that local/remote connections are transparent to the rest of the application.
Local domain sockets use the same mechanism as pipes for communication and don't have the TCP/IP stack overhead.
I don't really think you should worry about the overhead (which will be ridiculously low). Did you make sure using profiling tools that the bottleneck of your application is likely to be TCP overhead?
Anyways as Carl Smotricz said, I would go with sockets because it will be really trivial to separate the applications in the future.
I discussed this in an answer to a previous post. I had to compare socket, pipe, and shared memory communications. Pipes were definitely faster than sockets (maybe by a factor of 2 if I recall correctly ... I can check those numbers when I return to work). But those measurements were just for the pure communication. If the communication is a very small part of the overall work, then the difference will be negligible between the two types of communication.
Edit
Here are some numbers from the test I did a few years ago. Your mileage may vary (particularly if I made stupid programming errors). In this specific test, a "client" and "server" on the same machine echoed 100 bytes of data back and forth. It made 10,000 requests. In the document I wrote up, I did not indicate the specs of the machine, so it is only the relative speeds that may be of any value. But for the curious, the times given here are the average cost per request:
TCP/IP: .067 ms
Pipe with I/O Completion Ports: .042 ms
Pipe with Overlapped I/O: .033 ms
Shared Memory with Named Semaphore: .011 ms
There will be more overhead using TCP - that will involve breaking the data up into packets, calculating checksums and handling acknowledgement, none of which is necessary when communicating between two processes on the same machine. Using a pipe will just copy the data into and out of a buffer.
I don't know if this suites you, but a very common way of IPC (interprocess communication) under linux is by using the shared memory. It's actually ultra fast (I didn't profiled this, but this is just shared data on RAM with strong processing around it).
The main problem around this approuch is the semaphore, you must build a little system around it so you must make sure a process is not writing at the same time the other one is trying to read.
A very simple starter tutorial is at here
This is not as portable as using sockets, but the concept would be the same, so if you're migrating this to Windows, you will just have to change the shared memory create/attach layer.
Two things to consider:
Connection setup cost
Continuous Communication cost
On TCP:
(1) more costly - 3way handshake overhead required for (potentially) unreliable channel.
(2) more costly - IP level overhead (checksum etc.), TCP overhead (sequence number, acknowledgement, checksum etc.) pretty much all of which aren't necessary on the same machine because the channel is supposed to be reliable and not introduce network related impairments (e.g. packet reordering).
But I would still go with TCP provided it makes sense (i.e. depends on the situation) because of its ubiquity (read: easy cross-platform support) and the overhead shouldn't be a problem in most cases (read: profile, don't do premature optimization).
Updated: if cross-platform support isn't required and the accent is on performance, then go with named/domain pipes as I am pretty sure the platform developers will have optimize-out the unnecessary functionality deemed required for handling network level impairments.
unix domain socket is a very goog compromise. Not the overhead of tcp, but more evolutive than the pipe solution. A point you did not consider is that socket are bidirectionnal, while named pipes are unidirectionnal.
I think the pipes will be a little lighter, but I'm just guessing.
But since pipes are a local thing, there's probably a lot less complicated code involved.
Other people might tell you to try and measure both to find out. It's hard to go wrong with this answer, but you may not be willing to invest the time. That would leave you hoping my guess is correct ;)

MPI or Sockets?

I'm working on a loosely coupled cluster for some data processing. The network code and processing code is in place, but we are evaluating different methodologies in our approach. Right now, as we should be, we are I/O bound on performance issues, and we're trying to decrease that bottleneck. Obviously, faster switches like Infiniband would be awesome, but we can't afford the luxury of just throwing out what we have and getting new equipment.
My question posed is this. All traditional and serious HPC applications done on clusters is typically implemented with message passing versus sending over sockets directly. What are the performance benefits to this? Should we see a speedup if we switched from sockets?
MPI MIGHT use sockets. But there are also MPI implementation to be used with SAN (System area network) that use direct distributed shared memory. That of course if you have the hardware for that. So MPI allows you to use such resources in the future. On that case you can gain massive performance improvements (on my experience with clusters back at university time, you can reach gains of a few orders of magnitude). So if you are writting code that can be ported to higher end clusters, using MPI is a very good idea.
Even discarding performance issues, using MPI can save you a lot of time, that you can use to improve performance of other parts of your system or simply save your sanity.
I would recommend using MPI instead of rolling your own, unless you are very good at that sort of thing. Having wrote some distributed computing-esque applications using my own protocols, I always find myself reproducing (and poorly reproducing) features found within MPI.
Performance wise I would not expect MPI to give you any tangible network speedups - it uses sockets just like you. MPI will however provide you with much the functionality you would need for managing many nodes, i.e. synchronisation between nodes.
Performance is not the only consideration in this case, even on high performance clusters. MPI offers a standard API, and is "portable." It is relatively trivial to switch an application between the different versions of MPI.
Most MPI implementations use sockets for TCP based communication. Odds are good that any given MPI implementation will be better optimized and provide faster message passing, than a home grown application using sockets directly.
In addition, should you ever get a chance to run your code on a cluster that has InfiniBand, the MPI layer will abstract any of those code changes. This is not a trivial advantage - coding an application to directly use OFED (or another IB Verbs) implementation is very difficult.
Most MPI applications include small test apps that can be used to verify the correctness of the networking setup independently of your application. This is a major advantage when it comes time to debug your application. The MPI standard includes the "pMPI" interfaces, for profiling MPI calls. This interface also allows you to easily add checksums, or other data verification to all the message passing routines.
Message Passing is a paradigm not a technology. In the most general installation, MPI will use sockets to communicate. You could see a speed up by switching to MPI, but only in so far as you haven't optimized your socket communication.
How is your application I/O bound? Is it bound on transferring the data blocks to the work nodes, or is it bound because of communication during computation?
If the answer is "because of communication" then the problem is you are writing a tightly-coupled application and trying to run it on a cluster designed for loosely coupled tasks. The only way to gain performance will be to get better hardware (faster switches, infiniband, etc. )... maybe you could borrow time on someone else's HPC?
If the answer is "data block" transfers then consider assigning workers multiple data blocks (so they stay busy longer) & compress the data blocks before transfer. This is a strategy that can help in a loosely coupled application.
MPI has the benefit that you can do collective communications. Doing broadcasts/reductions in O(log p) /* p is your number of processors*/ instead of O(p) is a big advantage.
I'll have to agree with OldMan and freespace. Unless you know of a specific and improvement to some useful metric (performance, maintainability, etc.) over MPI, why reinvent the wheel. MPI represents a large amount of shared knowledge regarding the problem you are trying to solve.
There are a huge number of issues you need to address which is beyond just sending data. Connection setup and maintenance will all become your responsibility. If MPI is the exact abstraction (it sounds like it is) you need, use it.
At the very least, using MPI and later refactoring it out with your own system is a good approach costing the installation and dependency of MPI.
I especially like OldMan's point that MPI gives you much more beyond simple socket communication. You get a slew of parallel and distributed computing implementation with a transparent abstraction.
I have not used MPI, but I have used sockets quite a bit. There are a few things to consider on high performance sockets. Are you doing many small packets, or large packets? If you are doing many small packets consider turning off the Nagle algorithm for faster response:
setsockopt(m_socket, IPPROTO_TCP, TCP_NODELAY, ...);
Also, using signals can actually be much slower when trying to get a high volume of data through. Long ago I made a test program where the reader would wait for a signal, and read a packet - it would get a bout 100 packets/sec. Then I just did blocking reads, and got 10000 reads/sec.
The point is look at all these options, and actually test them out. Different conditions will make different techniques faster/slower. It's important to not just get opinions, but to put them to the test. Steve Maguire talks about this in "Writing Solid Code". He uses many examples that are counter-intuitive, and tests them to find out what makes better/faster code.
MPI uses sockets underneath, so really the only difference should be the API that your code interfaces with. You could fine tune the protocol if you are using sockets directly, but thats about it. What exactly are you doing with the data?
MPI Uses sockets, and if you know what you are doing you can probably get more bandwidth out of sockets because you need not send as much meta data.
But you have to know what you are doing and it's likely to be more error prone. essentially you'd be replacing mpi with your own messaging protocol.
For high volume, low overhead business messaging you might want to check out
OAMQ with several products. The open source variant OpenAMQ supposedly runs the trading at JP Morgan, so it should be reliable, shouldn't it?