Low-latency read of UDP port

Low-latency read of UDP port - c++

I am reading a single data item from a UDP port. It's essential that this read be the lowest latency possible. At present I'm reading via the boost::asio library's async_receive_from method. Does anyone know the kind of latency I will experience between the packet arriving at the network card, and the callback method being invoked in my user code?
Boost is a very good library, but quite generic, is there a lower latency alternative?
All opinions on writing low-latency UDP network programs are very welcome.
EDIT: Another question, is there a relatively feasible way to estimate the latency that I'm experiencing between NIC and user mode?

Your latency will vary, but it will be far from the best you can get. Here are few things that will stand in your way to the better latency:
Boost.ASIO
It constantly allocates/deallocates memory to store "state" in order to invoke a callback function associated with your read operation.
It does unnecessary mutex locking/unlocking in order to support a broken mix of async and sync approaches.
The worst, it constantly adds and removes event descriptors from the underlying notification mechanism.
All in all, asio is a good library for high-level application developers, but it comes with a big price tag and a lot of CPU cycle eating gremlins. Another alternative is libevent, it is a lot better, but still aims to support many notification mechanisms and be platform-independent. Nothing can beat native mechanisms, i.e. epoll.
Other things
UDP stack. It doesn't do a very good job for latency sensitive applications. One of the most popular solutions is OpenOnload. It by-passes the stack and works directly with your NIC.
Scheduler. By default, scheduler is optimized for throughput and not latency. You will have to tweak and tune your OS in order to make it latency oriented. Linux, for example, has a lot of "rt" patches for that purpose.
Watch out not to sleep. Once your process is sleeping, you will never get a good wakeup latency compared to constantly burning CPU and waiting for a packet to arrive.
Interference with other IRQs, processes etc.
I cannot tell you exact numbers, but assuming that you won't be getting a lot of traffic, using Boost and a regular Linux kernel, with a regular hardware, your latency will range somewhere between ~50 microseconds to ~100 milliseconds. It will improve a bit as you get more data, and after some point start dropping, and will always be ranging. I'd say that if you are OK with those numbers, don't bother optimizing.

I think using recv() in a "spin" loop thread and attach the thread to a single CPU core(Processor Affinity), the latency should be lower than using select(), the precision of select() varies from 1 to 10 micro-seconds while spin loop at 1 micro-second in my test.

Related

What actually happens in asynchronous IO

I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio

Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.

The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.

Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.

Performance of multithreaded TCP networking

I'm working on a project using the TCP protocol that may have to work with many 100s or more connections at once.
As such, I am uncertain as to what method I should collect and send this data.
I was wondering whether the principal of more threads = more performance applied here.
My reason for doubt is because all data still has to be fed through the network connection, of which most devices only have 1 active at a time. In addition, I know that repeated context switching can reduce performance as well.
However, I've seen from other sources suggesting that multithreading does indeed scale network performance to a point, and if that's true, why?
Currently, I'm using the Non-Boost variant of ASIO to handle networking.
Thanks in advance for any assistance.

ASIO is a wrapper around epoll/IOCP, and as such is optimized for high-performance non-blocking I/O. It's possible to achieve hundreds of thousands of simultaneous connections with this setup on a single thread. Indeed, the old-fashioned "a thread per client" setup could never reach this level is performance due to the context switching overhead.
With that said, depending on the protocol used, handling network requests and replies takes some CPU time, and on a high-rate network it might saturate the single CPU core on which the io_service is running. In that case it is possible to parallelize the io_service so that completion routines can run on more than one core. Still no context switching would take place if the number of threads doesn't exceed the number of available CPU cores/hardware threads. Context switching occurs when the same core needs to handle multiple threads and also when switching between user and kernel mode (i.e. twice for each system call).
Benchmark your server to see how many clients it can handle on a single thread. Chances are it will be enough. Parallelizing io_service comes at a cost of having to deal with completion routines running in parallel, which almost always requires additional synchronization, which means additional overhead.

You want about the same number of threads as you have CPU cores, including hypertreaded ones. Not more.
Each thread deals with a subset of the sockets. That way, you maximize CPU parallelism, but minimize overhead.
If you truly need in the 100s of connections and require low latency, you should consider UDP, where a single socket can receive from many remote addresses. But you then have to implement reliability yourself. Still, that's how multi-player AAA games servers are typically run. And there's good reasons for it.

Multi-Threading vs Single-Threading is a hard topic, and I think it all depends on the point of view of your implementation.
If you have a good event-driven system on one thread probably using single thread for low level network IO will be better.
Spawning threads have on itself a performance penalty as the system will need to attend them, of course it will be helpful to use the extra processors, but as you said when finally getting into the low level all threads will need some kind of synchronization, penalty again, unless you are using one socket per thread.
One mayor drawback of multi-threading (one socket per thread) on networks is that most of the time you system will be subject to 'slow loris' attacks.
Wikipedia for slow loris
Computerphile video on slow loris
So, I think you are better using multi-thread for other long waiting or time consuming tasks. Of course you should use non-blocking IO.

Boost Asio single threaded performance

I am implementing custom server that needs to maintain very large number (100K or more) of long lived connections. Server simply passes messages between sockets and it doesn't do any serious data processing. Messages are small, but many of them are received/send every second. Reducing latency is one of the goals. I realize that using multiple cores won't improve performance and therefore I decided to run the server in a single thread by calling run_one or poll methods of io_service object. Anyway multi-threaded server would be much harder to implement.
What are the possible bottlenecks? Syscalls, bandwidth, completion queue / event demultiplexing? I suspect that dispatching handlers may require locking (that is done internally by asio library). Is it possible to disable even queue locking (or any other locking) in boost.asio?
EDIT: related question. Does syscall performance improve with multiple threads? My feeling is that because syscalls are atomic/synchronized by the kernel adding more threads won't improve speed.

You might want to read my question from a few years ago, I asked it when first investigating the scalability of Boost.Asio while developing the system software for the Blue Gene/Q supercomputer.
Scaling to 100k or more connections should not be a problem, though you will need to be aware of the obvious resource limitations such as the maximum number of open file descriptors. If you haven't read the seminal C10K paper, I suggest reading it.
After you have implemented your application using a single thread and a single io_service, I suggest investigating a pool of threads invoking io_service::run(), and only then investigate pinning an io_service to a specific thread and/or cpu. There are multiple examples included in the Asio documentation for all three of these designs, and several questions on SO with more information. Be aware that as you introduce multiple threads invoking io_service::run() you may need to implement strands to ensure the handlers have exclusive access to shared data structures.

Using boost::asio you can write single-thread or multi-thread server approximately at same development cost. You can write single-threaded version as first version, then convert it to multithreaded, if needed.
Typically, only bottleneck for boost::asio is that epoll/kqueue reactor is working in a mutex. So, only one thread is doing epoll at same time. This can decrease performance in case when you have multithreaded server, which serves lots and lots very small packets. But, imo it anyway should be faster than just plain-singlethread server.
Now about your task. If you want to just pass messages between connections - i think it must be multithreaded server. The problem is syscalls(recv/send etc). An instruction is very easy think to do for CPU, but any syscall is not very "light" operation (everything is relative, but relative to other jobs in your task). So, with single thread you will get big syscalls overhead, its why i recommend to use multithreaded scheme.
Also, you can separate io_service and make it work as "io_service per thread" idiom. I think this must give best performance, but it has drawback: if one of io_service will get too big queue - other threads will not help it, so some connections may slowdown. On other side, with single io_service - queue overrun can lead to big locking overhead. All you can do - do the both variants and measure bandwidth/latency. It should be not too difficult to implement both variants.

Why the need for Async IO when reading sockets for non HTTP server

Im designing a c++ client application that listens to multiple ports for streams of short messages. After reading up on ACE, POCO, boost::asio and all the Proactor like design patterns, I am about to start with boost::asio.
One thing I notice it a constant theme of using Asynchronous socket IO, yet have not read a good problem description that async io solves. Are all these design patterns based on the assumption of HTTP web server design?
Since web servers are the most common application for complex latency sensitive concurrent socket programming, im starting to wonder if most of these patterns/idioms are geared for this one application.
My application will listen to a handful of sockets for short and frequent messages. A separate thread will need to combine all messages for processing. One thing that I am looking at design patterns for, is separating the connection management from the data processing. I wish to have the connections to try reconnect after a disconnect and have the processing thread continue as if nothing happened. What design pattern is recommended here?
I dont see how async io will improve performance in my case.

You're on the right track. It's smart to ask "why" especially with all the hype around asynchronous and event driven. There are applications other than the web. Consider message queues and financial transactions like high frequency trading. Basically any time that waiting costs money or loses an opportunity to serve a customer is a candidate for async. The web is a big example because the network is so much faster than the database. As always, ask "does this make sense" for your app. Async adds a lot of complexity if you're not benefiting from it.
Your short, rapid messages may actually benefit a lot from async if the mean time between messages is comparable to the time required to process each message, especially if that processing includes persistence. But you don't have to rush into async. Instrument your blocking code and see whether you really have a bottleneck.
I hope this helps.

Using the blocking calls pattern would entail:
1. Listening on a socket vector of size N
2. When a message arrives, you wake up with a start in time K, find and start processing the message, employing a time T (it does not matter if the processing is offloaded to another thread: in this case T becomes your offloading time)
3. You finish examining the vector and GOTO 1
So you might say that if M messages arrive, and another message arrives during the K+M*T dispatching time, the M+1-th message will find itself waiting K+M*T time. Is this acceptable for your expected values of K (constant), M (function of traffic) and T (function of resources and system load)?
Asynchronous processing, well, actually does not exist. There will always be a "synchronous" IO loop somewhere, only it will be so well integrated in the kernel (or even hardware) that it will run 10-100x faster than your own, and therefore will be likely to scale better with large values of M. The delay is still in the form K1+M*T1, but this time K1 and T1 are much lower. Or maybe K1 is a bit higher and T1 is significantly lower: the architecture "scales better" for large values of M.
If your values of M are usually low, then the advantages of asynchronicity are proportionally smaller. In the absurd case when you only have one message in the application lifetime, synchronous or asynchronous makes next to no difference.
Take into account another factor: if the number of messages becomes really large, asynchronicity has its advantages; but if the messages themselves are independent (the changes caused by message A do not influence processing of message B), then you can remain synchronous and scale horizontally, preparing a number Z of "message concentrators" each receiving a fraction M/Z of the total traffic.
If the processing requires performing other calls to other services (cache, persistence, information retrieval, authentication...), increasing the T factor, then you'll be better off turning to multithreaded or even asynchronous. With multithreading you shave T to a fraction of its value (the dispatching time only). Asynchronous in a sense does the same, but shaving even more, and taking care of more programming boilerplate for you.

Threads ordering in C++/Linux

I'm currently doing a simulation of a hard disk drive IOs in C++, and I'm using pthread threads and a mutex to do the reading on the disk.
However I'm trying to optimize the reading time by ordering my threads. The problem is that is my disk is currently reading a sector, and a bunch of requests to read arrive, any of them will be executed. What I want is ordering them so that the request with the closest sector is executed next.
This way, the head of the virtual hard disk drive won't move excessively.
My question is : Is using Linux process priority system a good way to make sure that the closest reading request will be executed before the others? If not, what could I rely on to do this?
PS: Sorry for my english.
Thanks for your help.

It is very rarely a good idea to rely on the exact behaviour of process priority schemes, especially on a general purpose operating system like Linux, because they don't really guarantee you any particular behaviour. Making something the very highest priority won't help if it references some address in memory or some I/O call that causes it to held up for an instant - the operating system will then run some lower priority process instead, and you will be unpleasantly surprised.
If you want to be sure of the order in which disk I/O requests are completed, or to simulate this, you could create a thread that keeps a list of pending I/O and asks for the requests to be executed one at a time, in an order it controls.

The I/O schedulers in the Linux kernel can re-order and coalesce reads (and to some extent writes) so that their ordering is more favorable for the disk, just like you are describing. This affects the process scheduler (which takes care of threads too) in that the threads waiting for I/O also get "re-ordered" - their read or write requests complete in the order in which the disk served them, not in the order in which they made their request. (This is a very simplified view of what really happens.)
But if you're simulating disk I/O, i.e. if you're not actually doing real I/O, the I/O scheduler isn't involved at all. Only the process scheduler. And the process scheduler has no idea that you're "simulating" a hard disk - it has no information about what the processes are doing, just information about whether or not they're in need of CPU resources. (Again this is a simplified view of how things work).
So the process scheduler will not help you in re-ordering or coalescing your simulation of read requests. You need to implement that logic in your code. (Reading about I/O schedulers is a great idea.)
If you do submit real I/O, then doing the re-ordering yourself could improve performance in some situations, and indeed the I/O scheduler's algorithms for optimizing throughput or latency will affect the way your threads are scheduled (for blocking I/O anyway - asynchronous I/O makes it a bit more complicated still).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js