Performance of multithreaded TCP networking

Performance of multithreaded TCP networking - c++

I'm working on a project using the TCP protocol that may have to work with many 100s or more connections at once.
As such, I am uncertain as to what method I should collect and send this data.
I was wondering whether the principal of more threads = more performance applied here.
My reason for doubt is because all data still has to be fed through the network connection, of which most devices only have 1 active at a time. In addition, I know that repeated context switching can reduce performance as well.
However, I've seen from other sources suggesting that multithreading does indeed scale network performance to a point, and if that's true, why?
Currently, I'm using the Non-Boost variant of ASIO to handle networking.
Thanks in advance for any assistance.

ASIO is a wrapper around epoll/IOCP, and as such is optimized for high-performance non-blocking I/O. It's possible to achieve hundreds of thousands of simultaneous connections with this setup on a single thread. Indeed, the old-fashioned "a thread per client" setup could never reach this level is performance due to the context switching overhead.
With that said, depending on the protocol used, handling network requests and replies takes some CPU time, and on a high-rate network it might saturate the single CPU core on which the io_service is running. In that case it is possible to parallelize the io_service so that completion routines can run on more than one core. Still no context switching would take place if the number of threads doesn't exceed the number of available CPU cores/hardware threads. Context switching occurs when the same core needs to handle multiple threads and also when switching between user and kernel mode (i.e. twice for each system call).
Benchmark your server to see how many clients it can handle on a single thread. Chances are it will be enough. Parallelizing io_service comes at a cost of having to deal with completion routines running in parallel, which almost always requires additional synchronization, which means additional overhead.

You want about the same number of threads as you have CPU cores, including hypertreaded ones. Not more.
Each thread deals with a subset of the sockets. That way, you maximize CPU parallelism, but minimize overhead.
If you truly need in the 100s of connections and require low latency, you should consider UDP, where a single socket can receive from many remote addresses. But you then have to implement reliability yourself. Still, that's how multi-player AAA games servers are typically run. And there's good reasons for it.

Multi-Threading vs Single-Threading is a hard topic, and I think it all depends on the point of view of your implementation.
If you have a good event-driven system on one thread probably using single thread for low level network IO will be better.
Spawning threads have on itself a performance penalty as the system will need to attend them, of course it will be helpful to use the extra processors, but as you said when finally getting into the low level all threads will need some kind of synchronization, penalty again, unless you are using one socket per thread.
One mayor drawback of multi-threading (one socket per thread) on networks is that most of the time you system will be subject to 'slow loris' attacks.
Wikipedia for slow loris
Computerphile video on slow loris
So, I think you are better using multi-thread for other long waiting or time consuming tasks. Of course you should use non-blocking IO.

Related

Benefits of a multi thread program in a unicore system [duplicate]

This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??

It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.

Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.

Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.

Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.

As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.

I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.

For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.

In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.

As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.

Boost Asio single threaded performance

I am implementing custom server that needs to maintain very large number (100K or more) of long lived connections. Server simply passes messages between sockets and it doesn't do any serious data processing. Messages are small, but many of them are received/send every second. Reducing latency is one of the goals. I realize that using multiple cores won't improve performance and therefore I decided to run the server in a single thread by calling run_one or poll methods of io_service object. Anyway multi-threaded server would be much harder to implement.
What are the possible bottlenecks? Syscalls, bandwidth, completion queue / event demultiplexing? I suspect that dispatching handlers may require locking (that is done internally by asio library). Is it possible to disable even queue locking (or any other locking) in boost.asio?
EDIT: related question. Does syscall performance improve with multiple threads? My feeling is that because syscalls are atomic/synchronized by the kernel adding more threads won't improve speed.

You might want to read my question from a few years ago, I asked it when first investigating the scalability of Boost.Asio while developing the system software for the Blue Gene/Q supercomputer.
Scaling to 100k or more connections should not be a problem, though you will need to be aware of the obvious resource limitations such as the maximum number of open file descriptors. If you haven't read the seminal C10K paper, I suggest reading it.
After you have implemented your application using a single thread and a single io_service, I suggest investigating a pool of threads invoking io_service::run(), and only then investigate pinning an io_service to a specific thread and/or cpu. There are multiple examples included in the Asio documentation for all three of these designs, and several questions on SO with more information. Be aware that as you introduce multiple threads invoking io_service::run() you may need to implement strands to ensure the handlers have exclusive access to shared data structures.

Using boost::asio you can write single-thread or multi-thread server approximately at same development cost. You can write single-threaded version as first version, then convert it to multithreaded, if needed.
Typically, only bottleneck for boost::asio is that epoll/kqueue reactor is working in a mutex. So, only one thread is doing epoll at same time. This can decrease performance in case when you have multithreaded server, which serves lots and lots very small packets. But, imo it anyway should be faster than just plain-singlethread server.
Now about your task. If you want to just pass messages between connections - i think it must be multithreaded server. The problem is syscalls(recv/send etc). An instruction is very easy think to do for CPU, but any syscall is not very "light" operation (everything is relative, but relative to other jobs in your task). So, with single thread you will get big syscalls overhead, its why i recommend to use multithreaded scheme.
Also, you can separate io_service and make it work as "io_service per thread" idiom. I think this must give best performance, but it has drawback: if one of io_service will get too big queue - other threads will not help it, so some connections may slowdown. On other side, with single io_service - queue overrun can lead to big locking overhead. All you can do - do the both variants and measure bandwidth/latency. It should be not too difficult to implement both variants.

Low-latency read of UDP port

I am reading a single data item from a UDP port. It's essential that this read be the lowest latency possible. At present I'm reading via the boost::asio library's async_receive_from method. Does anyone know the kind of latency I will experience between the packet arriving at the network card, and the callback method being invoked in my user code?
Boost is a very good library, but quite generic, is there a lower latency alternative?
All opinions on writing low-latency UDP network programs are very welcome.
EDIT: Another question, is there a relatively feasible way to estimate the latency that I'm experiencing between NIC and user mode?

Your latency will vary, but it will be far from the best you can get. Here are few things that will stand in your way to the better latency:
Boost.ASIO
It constantly allocates/deallocates memory to store "state" in order to invoke a callback function associated with your read operation.
It does unnecessary mutex locking/unlocking in order to support a broken mix of async and sync approaches.
The worst, it constantly adds and removes event descriptors from the underlying notification mechanism.
All in all, asio is a good library for high-level application developers, but it comes with a big price tag and a lot of CPU cycle eating gremlins. Another alternative is libevent, it is a lot better, but still aims to support many notification mechanisms and be platform-independent. Nothing can beat native mechanisms, i.e. epoll.
Other things
UDP stack. It doesn't do a very good job for latency sensitive applications. One of the most popular solutions is OpenOnload. It by-passes the stack and works directly with your NIC.
Scheduler. By default, scheduler is optimized for throughput and not latency. You will have to tweak and tune your OS in order to make it latency oriented. Linux, for example, has a lot of "rt" patches for that purpose.
Watch out not to sleep. Once your process is sleeping, you will never get a good wakeup latency compared to constantly burning CPU and waiting for a packet to arrive.
Interference with other IRQs, processes etc.
I cannot tell you exact numbers, but assuming that you won't be getting a lot of traffic, using Boost and a regular Linux kernel, with a regular hardware, your latency will range somewhere between ~50 microseconds to ~100 milliseconds. It will improve a bit as you get more data, and after some point start dropping, and will always be ranging. I'd say that if you are OK with those numbers, don't bother optimizing.

I think using recv() in a "spin" loop thread and attach the thread to a single CPU core(Processor Affinity), the latency should be lower than using select(), the precision of select() varies from 1 to 10 micro-seconds while spin loop at 1 micro-second in my test.

Multithreading vs multiprocessing

I am new to this kind of programming and need your point of view.
I have to build an application but I can't get it to compute fast enough. I have already tried Intel TBB, and it is easy to use, but I have never used other libraries.
In multiprocessor programming, I am reading about OpenMP and Boost for the multithreading, but I don't know their pros and cons.
In C++, when is multi threaded programming advantageous compared to multiprocessor programming and vice versa?Which is best suited to heavy computations or launching many tasks...? What are their pros and cons when we build an application designed with them? And finally, which library is best to work with?

Multithreading means exactly that, running multiple threads. This can be done on a uni-processor system, or on a multi-processor system.
On a single-processor system, when running multiple threads, the actual observation of the computer doing multiple things at the same time (i.e., multi-tasking) is an illusion, because what's really happening under the hood is that there is a software scheduler performing time-slicing on the single CPU. So only a single task is happening at any given time, but the scheduler is switching between tasks fast enough so that you never notice that there are multiple processes, threads, etc., contending for the same CPU resource.
On a multi-processor system, the need for time-slicing is reduced. The time-slicing effect is still there, because a modern OS could have hundred's of threads contending for two or more processors, and there is typically never a 1-to-1 relationship in the number of threads to the number of processing cores available. So at some point, a thread will have to stop and another thread starts on a CPU that the two threads are sharing. This is again handled by the OS's scheduler. That being said, with a multiprocessors system, you can have two things happening at the same time, unlike with the uni-processor system.
In the end, the two paradigms are really somewhat orthogonal in the sense that you will need multithreading whenever you want to have two or more tasks running asynchronously, but because of time-slicing, you do not necessarily need a multi-processor system to accomplish that. If you are trying to run multiple threads, and are doing a task that is highly parallel (i.e., trying to solve an integral), then yes, the more cores you can throw at a problem, the better. You won't necessarily need a 1-to-1 relationship between threads and processing cores, but at the same time, you don't want to spin off so many threads that you end up with tons of idle threads because they must wait to be scheduled on one of the available CPU cores. On the other hand, if your parallel tasks requires some sequential component, i.e., a thread will be waiting for the result from another thread before it can continue, then you may be able to run more threads with some type of barrier or synchronization method so that the threads that need to be idle are not spinning away using CPU time, and only the threads that need to run are contending for CPU resources.

There are a few important points that I believe should be added to the excellent answer by #Jason.
First, multithreading is not always an illusion even on a single processor - there are operations that do not involve the processor. These are mainly I/O - disk, network, terminal etc. The basic form for such operation is blocking or synchronous, i.e. your program waits until the operation is completed and then proceeds. While waiting, the CPU is switched to another process/thread.
if you have anything you can do during that time (e.g. background computation while waiting for user input, serving another request etc.) you have basically two options:
use asynchronous I/O: you call a non-blocking I/O providing it with a callback function, telling it "call this function when you are done". The call returns immediately and the I/O operation continues in the background. You go on with the other stuff.
use multithreading: you have a dedicated thread for each kind of task. While one waits for the blocking I/O call, the other goes on.
Both approaches are difficult programming paradigms, each has its pros and cons.
with async I/O the logic of the program's logic is less obvious and is difficult to follow and debug. However you avoid thread-safety issues.
with threads, the challange is to write thread-safe programs. Thread safety faults are nasty bugs that are quite difficult to reproduce. Over-use of locking can actually lead to degrading instead of improving the performance.
(coming to the multi-processing)
Multithreading made popular on Windows because manipulating processes is quite heavy on Windows (creating a process, context-switching etc.) as opposed to threads which are much more lightweight (at least this was the case when I worked on Win2K).
On Linux/Unix, processes are much more lightweight. Also (AFAIK) threads on Linux are implemented actually as a kind of processes internally, so there is no gain in context-switching of threads vs. processes. However, you need to use some form of IPC (inter-process communications), as shared memory, pipes, message queue etc.
On a more lite note, look at the SQLite FAQ, which declares "Threads are evil"! :)

To answer the first question:
The best approach is to just use multithreading techniques in your code until you get to the point where even that doesn't give you enough benefit. Assume the OS will handle delegation to multiple processors if they're available.
If you actually are working on a problem where multithreading isn't enough, even with multiple processors (or if you're running on an OS that isn't using its multiple processors), then you can worry about discovering how to get more power. Which might mean spawning processes across a network to other machines.
I haven't used TBB, but I have used IPP and found it to be efficient and well-designed. Boost is portable.

Just wanted to mention that the Flow-Based Programming ( http://www.jpaulmorrison.com/fbp ) paradigm is a naturally multiprogramming/multiprocessing approach to application development. It provides a consistent application view from high level to low level. The Java and C# implementations take advantage of all the processors on your machine, but the older C++ implementation only uses one processor. However, it could fairly easily be extended to use BOOST (or pthreads, I assume) by the use of locking on connections. I had started converting it to use fibers, but I'm not sure if there's any point in continuing on this route. :-) Feedback would be appreciated. BTW The Java and C# implementations can even intercommunicate using sockets.

How many threads to create and when?

I have a networking Linux application which receives RTP streams from multiple destinations, does very simple packet modification and then forwards the streams to the final destination.
How do I decide how many threads I should have to process the data? I suppose, I cannot open a thread for each RTP stream as there could be thousands. Should I take into account the number of CPU cores? What else matters?
Thanks.

It is important to understand the purpose of using multiple threads on a server; many threads in a server serve to decrease latency rather than to increase speed. You don't make the cpu more faster by having more threads but you make it more likely a thread will always appear at within a given period to handle a request.
Having a bunch of threads which just move data in parallel is a rather inefficient shot-gun (Creating one thread per request naturally just fails completely). Using the thread pool pattern can be a more effective, focused approach to decreasing latency.
Now, in the thread pool, you want to have at least as many threads as you have CPUs/cores. You can have more than this but the extra threads will again only decrease latency and not increase speed.
Think the problem of organizing server threads as akin to organizing a line in a super market. Would you like to have a lot of cashiers who work more slowly or one cashier who works super fast? The problem with the fast cashier isn't speed but rather that one customer with a lot of groceries might still take up a lot of their time. The need for many threads comes from the possibility that a few request that will take a lot of time and block all your threads. By this reasoning, whether you benefit from many slower cashiers depends on whether your have the same number of groceries or wildly different numbers. Getting back to the basic model, what this means is that you have to play with your thread number to figure what is optimal given the particular characteristics of your traffic, looking at the time taken to process each request.

Classically the number of reasonable threads is depending on the number of execution units, the ratio of IO to computation and the available memory.
Number of Execution Units (XU)
That counts how many threads can be active at the same time. Depending on your computations that might or might not count stuff like hyperthreads -- mixed instruction workloads work better.
Ratio of IO to Computation (%IO)
If the threads never wait for IO but always compute (%IO = 0), using more threads than XUs only increase the overhead of memory pressure and context switching. If the threads always wait for IO and never compute (%IO = 1) then using a variant of poll() or select() might be a good idea.
For all other situations XU / %IO gives an approximation of how many threads are needed to fully use the available XUs.
Available Memory (Mem)
This is more of a upper limit. Each thread uses a certain amount of system resources (MemUse). Mem / MemUse gives you an approximation of how many threads can be supported by the system.
Other Factors
The performance of the whole system can still be constrained by other factors even if you can guess or (better) measure the numbers above. For example, there might be another service running on the system, which uses some of the XUs and memory. Another problem is general available IO bandwidth (IOCap). If you need less computing resources per transferred byte than your XUs provide, obviously you'll need to care less about using them completely and more about increasing IO throughput.
For more about this latter problem, see this Google Talk about the Roofline Model.

I'd say, try using just ONE thread; it makes programming much easier. Although you'll need to use something like libevent to multiplex the connections, you won't have any unexpected synchronisation issues.
Once you've got a working single-threaded implementation, you can do performance testing and make a decision on whether a multi-threaded one is necessary.
Even if a multithreaded implementation is necessary, it may be easier to break it into several processes instead of threads (i.e. not sharing address space; either fork() or exec multiple copies of the process from a parent) if they don't have a lot of shared data.
You could also consider using something like Python's "Twisted" to make implementation easier (this is what it's designed for).
Really there's probably not a good case for using threads over processes - but maybe there is in your case, it's difficult to say. It depends how much data you need to share between threads.

I would look into a thread pool for this application.
http://threadpool.sourceforge.net/
Allow the thread pool to manage your threads and the queue.
You can tweak the maximum and minimum number of threads used based on performance profiling later.

Listen to the people advising you to use libevent (or OS specific utilities such as epoll/kqueue). In the case of many connections this is an absolute must because, like you said, creating threads will be an enormous perfomance hit, and select() also doesn't quite cut it.

Let your program decide. Add code to it that measures throughput and increases/decreases the number of threads dynamically to maximize it.
This way, your application will always perform well, regardless of the number of execution cores and other factors

It is a good idea to avoid trying to create one (or even N) threads per client request. This approach is classically non-scalable and you will definitely run into problems with memory usage or context switching. You should look at using a thread pool approach instead and look at the incoming requests as tasks for any thread in the pool to handle. The scalability of this approach is then limited by the ideal number of threads in the pool - usually this is related to the number of CPU cores. You want to try to have each thread use exactly 100% of the CPU on a single core - so in the ideal case you would have 1 thread per core, this will reduce context switching to zero. Depending on the nature of the tasks, this might not be possible, maybe the threads have to wait for external data, or read from disk or whatever so you may find that the number of threads is increased by some scaling factor.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js