boost::asio threadpool vs. io_service_per_cpu design

boost::asio threadpool vs. io_service_per_cpu design - c++

Currently I´m not sure, I try to make a high-performance server, I got a 6Core CPU, so if I would use the "io_service_per_cpu" design, I have 6 io_service´s.
I already heard that the threadpool design isn´t the best one, but I´m not sure about that.
What knowledge do you have? Someone already made a Stress test with each, or something else?

In my experience it is vastly easier to approach asynchronous application design with the following order:
single thread and a single io_service
multiple threads, each invoking io_service::run() from a single io_service. Use strands for handlers that require access to shared data structures.
io_service per cpu
The motivation for changing between each of these designs should be done after profiling your application. Note that the HTTP Server 2 example only shows how to use an io_service per CPU, it does not show you when or why to use such a design.

Another good way to approach this: start up multiple copies of your process and bind each one to a different core using your OS'es. For FreeBSD, use cpuset. The OS is going to do a better job than any userland code will. Then you need to use an external load balancer to distribute load across your server instances. Extra points for binding a NIC interrupts to a particular CPU.

Related

How to limit Boost.Asio memory

I'm having trouble managing the work .post()'ed to Boost.Asio's io_context, having multiple questions about it (newbie warning).
Background: I'm writing a library that connects to a large number of different hosts for shorts periods at a time each (connect, send data, receive answer, close), and I figured using Boost.Asio. The documentation is scarce (too DRY?)
My current approach is this: (assuming a quad core machine): two physical cores run CPU bound sync operations, and post() additional work items to io_context. Two other threads are .run()ing and performing completion handlers.
1- The work scheduler
As per this amazing answer,
Boost.Asio may start some of the work as soon as it has been told about it, and other times it may wait to do the work at a later point in time.
When does boost.asio do what? On what basis is the queued work later processed?
2- Multiple Producers/ Multiple Consumers
As per This article,
At its core, Boost Asio provides a task execution framework that you can use to perform operations of any kind. You create your tasks as function objects and post them to a task queue maintained by Boost Asio. You enlist one or more threads to pick these tasks (function objects) and invoke them. The threads keep picking up tasks, one after the other till the task queues are empty at which point the threads do not block but exit.
I am failing to find a way to put a cap on the length of this task queue. This answer gives a couple of solutions, but they both involve locking, something I'd like to avoid as much as possible.
3- Are strands really necessary? How do I "disable them"
As detailed in this answer, boost uses an implicit strand per connection. Making potentially millions of connections, the memory savings by "bypassing" strands make sense to me. As the requests I make are independent (different host to each request), operations I make within a single connection is already serialized (callback chain) so I have no overlapping reads & writes, and no synchronization is expected from Boost.Asio. Does it make sense for me to try and bypass strands? If so, how?
4- Scaling design approach (A bit vague because I have no clue)
As stated in my background section, I'm running two io_contexts on two physical cores, each with two threads one for writing and one for reading. My goal here is to spew packets as fast as I can, and I have already
Compiled asio with BoringSSL (OpenSSL is a serious bottleneck)
Wrote my own c-ares resolver service to avoid async-ish DNS queries running in a thread loop.
But it still happens that my network driver starts timing out when multiple connections are opened. So how do I dynamically adjust boost.asio's throughput, the network adapter can cope with it?
My question(s) is most likely ill-informed as I'm no expert in network programming, and I know this a complex problem, I'd appreciate it if someone left pointers for me to look before closing the question or making it "dead".
Thank you.

Designing a multi-client tcp server to process data

I am attempting to rewrite my current project to include more features and stability, and need some help designing it. Here is the jist of it (for linux):
TCP_SERVER receives connection (auth packet)
TCP_SERVER starts a new (thread/fork) to handle the new client
TCP_SERVER will be receiving many packets from client > which will be added to a circular buffer
A separate thread will be created for that client to process those packets and build a list of objects
Another thread should be created to send parts of the list of objects to another client
The reason to separate all the processing into threads is because server will be getting many packets and the processing wont be able to keep up (which needs to be quick, as its time sensitive) (im not sure if tcp will drop packets if the internal buffer gets too large?), and another thread to send to another client to keep the processing fast as possible.
So for each new connection, 3 threads should be created. 1 to receive packets, 1 to process them, and 1 to send the processed data to another client (which is technically the same person/ip just on a different device)
And i need help designing this, as how to structure this, what to use (forks/threads), what libraries to use.

Trying to do this yourself is going to cause you a world of pain. Focus on your actual application, and leverage an existing socket handling framework. For example, you said:
for each new connection, 3 threads should be created
That statement says the following:
1. You haven't done this before, at scale, and haven't realized the impact all these threads will have.
2. You've never benchmarked thread creation or synchronous operations.
3. The number of things that can go wrong with this approach is pretty overwhelming.
Give some serious thought to using an existing library that does most of this for you. Getting the scaffolding right around this can literally take years, and you're better off focusing on your code rather than all the random plumbing.
The Boost C++ libraries seem to have a nice Async C++ socket handling infrastructure. Combine this with some of the existing C++ thread pools and you could likely have a highly performant solution up fairly quickly.
I would also question your use of C++ for this. Java and C# both do highly scalable socket servers pretty well, and some of the higher level language tooling (Spring, Guarva, etc) can be very, very valuable. If you ever want to secure this, via TLS or another mechanism, you'll also probably find this much easier in Java or C# than in C++.
Some of the major things you'll care about:
1. True Async I/O will be a huge perf and scalability win. Try really hard to do this. The boost asio library looks pretty nice.
2. Focus on your features and stability, rather than building a new socket handling platform.
3. Threads are expensive, avoid creating them. Thread pools are your friend.

You plan to create one-or-more threads for every connection your server handles. Threads are not free, they come with a memory and CPU overhead, and when you have many active threads you also begin to have resource contention.
What usage pattern do you anticipate? Do you expect that when you have 8 connections, all 8 network threads will be consuming 100% of a cpu core pushing/pulling packets? Or do you expect them to have a relatively low turn-around?
As you add more threads, you will begin to have to spend more time competing for resources in things like mutexes etc.
A better pattern is to have one or more thread for network io - most os'es have mechanisms for saying "tell me when one or more of these network connections has io" which is an efficiency saving over having lots of individual threads all doing the same thing for just one connection.
Then for actual processing, spin up a pool of worker threads to do actual work, allowing you to minimize the competition for resources. You can monitor work load to determine if you need to spin up more to meet delivery requirements.
You might also want to look into something to implement the network IO infrastructure for you; I've had really good performance results with libevent but then I've only had to deal with very high performance/reliability networking systems.

Boost Asio single threaded performance

I am implementing custom server that needs to maintain very large number (100K or more) of long lived connections. Server simply passes messages between sockets and it doesn't do any serious data processing. Messages are small, but many of them are received/send every second. Reducing latency is one of the goals. I realize that using multiple cores won't improve performance and therefore I decided to run the server in a single thread by calling run_one or poll methods of io_service object. Anyway multi-threaded server would be much harder to implement.
What are the possible bottlenecks? Syscalls, bandwidth, completion queue / event demultiplexing? I suspect that dispatching handlers may require locking (that is done internally by asio library). Is it possible to disable even queue locking (or any other locking) in boost.asio?
EDIT: related question. Does syscall performance improve with multiple threads? My feeling is that because syscalls are atomic/synchronized by the kernel adding more threads won't improve speed.

You might want to read my question from a few years ago, I asked it when first investigating the scalability of Boost.Asio while developing the system software for the Blue Gene/Q supercomputer.
Scaling to 100k or more connections should not be a problem, though you will need to be aware of the obvious resource limitations such as the maximum number of open file descriptors. If you haven't read the seminal C10K paper, I suggest reading it.
After you have implemented your application using a single thread and a single io_service, I suggest investigating a pool of threads invoking io_service::run(), and only then investigate pinning an io_service to a specific thread and/or cpu. There are multiple examples included in the Asio documentation for all three of these designs, and several questions on SO with more information. Be aware that as you introduce multiple threads invoking io_service::run() you may need to implement strands to ensure the handlers have exclusive access to shared data structures.

Using boost::asio you can write single-thread or multi-thread server approximately at same development cost. You can write single-threaded version as first version, then convert it to multithreaded, if needed.
Typically, only bottleneck for boost::asio is that epoll/kqueue reactor is working in a mutex. So, only one thread is doing epoll at same time. This can decrease performance in case when you have multithreaded server, which serves lots and lots very small packets. But, imo it anyway should be faster than just plain-singlethread server.
Now about your task. If you want to just pass messages between connections - i think it must be multithreaded server. The problem is syscalls(recv/send etc). An instruction is very easy think to do for CPU, but any syscall is not very "light" operation (everything is relative, but relative to other jobs in your task). So, with single thread you will get big syscalls overhead, its why i recommend to use multithreaded scheme.
Also, you can separate io_service and make it work as "io_service per thread" idiom. I think this must give best performance, but it has drawback: if one of io_service will get too big queue - other threads will not help it, so some connections may slowdown. On other side, with single io_service - queue overrun can lead to big locking overhead. All you can do - do the both variants and measure bandwidth/latency. It should be not too difficult to implement both variants.

Thread per connection vs Reactor pattern (with a thread pool)?

I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.

Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.

Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.

FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1

I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.

as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.

boost::asio starting different services in threads?

Seems like all the examples always show running the same io_service in all threads.
Can you start multiple io_services? Here is what I would like to do:
Start io_service A in the main thread for handling user input...
Start another io_service B in another thread that then can start a bunch of worker
threads all sharing io_service B.
Users on io_service A can "post" work on io_service B so that it gets done on the worker pool but no work is to be done on io_service A, i.e. the main thread.
Is this possible? Does this make sense?
Thanks

In my experience, it really depends on the application if an io_service per cpu or one per process is better performing. There was a discussion on the asio-users mailing list a few years ago on this very topic.
The Boost.Asio documentation has some great examples showing these two techniques in the HTTP Server 2 and HTTP Server 3 examples. But keep in mind the second HTTP server just shows how to use this technique, not when or why to use it. Those questions will need to be answered by profiling your application.
In general, you should use the following order when creating applications using Boost.Asio
Single threaded
Thread pool with a single io_service
Multiple io_service objects with some sort of CPU affinity

Good question!
Yes, it is possible for one. In an application I'm currently working on I have broken up the application into separate components responsible for different aspects of the system. Each component runs in its own thread, has its own set of timers, does its own network I/O using asio. From a testability/design perspective, it seems more clean to me, since no component can interfere with another, but I stand to be corrected. I suppose I could rewrite everything passing in the io service as a parameter, but currently haven't found the need to do so.
So coming back to your question, you can do whatever you want, IMO it's more a case of try it out and change it if you run into any issues.
Also, you might want to take a look at what Sam Miller pointed out in a different post WRT handling user input ( that is if you're using a console): https://stackoverflow.com/questions/5210796/boost-asio-how-to-write-console-server

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js