boost::asio multi-threading problem - c++

Ive got a server that uses boost::asio which I wish to make multi-threaded.
The server can be broken down into several "areas", with the sockets starting in a connect area, then once connected to a client being moved to a authentication area (i.e. login or register) before then moving between various other parts of the server depedning on what the client is doing.
I don't particularly want to just use a thread pool on a single io_service for all the sockets, since a large number of locks will be required, especially in areas with a large amount of interaction with common resources. However, instead I want to give each server component (like say authentication) their own thread.
However I'm not sure on how to do this. I considered the idea of giving each component its own io_service, so it could use whatever threads it wanted, however sockets area tied to an io_service, and I'm not sure how to then move a clients socket from one component to another.

You can solve this with asio::io_service::strand. Create a thread pool for io_service as usual. Once you've established a connection with the client, from there on wrap all async calls with a io_service::strand. One strand per client. This essentially guarantees that from the client's point of view it is single threaded.

First, I'd advocate considering the multi-process approach instead; it is a very straightforward, easy to reason about and debug, and easy to scale architecture.
A server design where you can scale horizontally - several instances of the server, where state within each does not need to be shared between servers (e.g. shared state can be in a common database (SQL, Voldemort (persistant) or Redis (sets and lists - very cool, I'm really excited about a persistent version), memcached (unreliable) or such) - is more easily scaleable.
You could, for example, have a single listener thread that balances between several server processes using UNIX sendmsg() to transfer the descriptor. This architecture would be straightforward to migrate to multi machine with hardware load-balancers later.
The area idea in the poster is intriguing. It could be that, rather than locking, you could do it all by message queues. Reason that disk IO - even with SSD and such - and the network are the real bottlenecks and it is not necessary to be as careful with CPU; the latencies of messages passing between threads is not such a big deal, and depending on your operating system the threads (or processes) could be scheduled to different cores in an SMP setup.
But ultimately, once you reach saturation, to scale up the area idea you need faster cores and not more of them. Here's an interesting monologue from one of our hosts about that.

Related

Does this code satisfy the concurrency requirement?

I am following this code of an C++ http server. One of the requirement is concurrency. That seems to be taken care of by the following chunk of code:
if(true) {
if(pthread_create(&thread, 0, handle_request, pcliefd) < 0) {
perror("pthread_create()");
}
} else {
handle_request(pcliefd);
}
I then come across a simpler code in this article. pthread is not used here. The response is handle by a write nested inside while(1). I suppose this simpler code does not meet the concurrency requirement? Anyways, what is the point of using thread to handle concurrency if the response is so simple? Is there something bigger behind this requirement?
The goal of your first linked question was to demonstrate a minimum of concurrency. Not a useful amount, just the bare minimum. Your second link doesn't even have that, as you correctly assumed.
Real webservers will be more complex. For starters, you don't want too much concurrency; there are only a limited number of CPU cores in your computer. See std::thread::hardware_conccurency
Anyways, what is the point of using thread to handle concurrency if the response is so simple?
This is actually a good question. The problem you face, when you want to handle a large number of clients is, that the read() and write() system calls are usually blocking. That means, they block your current thread as long as they take to complete the requested operation.
Say you have two clients, that send a request to your single threaded, non-concurrent server. Client A belongs to some lonely guy in a mountain hut with a real slow internet connection. Your listen() call returns and your program calls the handler routine for client A. Now while the bits slowly trickle through the mountain cable and your handler routine waits for the request to be transmitted, a second client B connects to your server. This one belongs to a business man at his high speed office internet access.
The problem here is, that even if your response is so simple, the high speed client still has to wait until your handler routine returns and can process the next request. One slow client can slow down all the other clients, which is obviously not what you want.
You can solve that problem using two approaches:
(that is the attempt in your code) you create a new thread for each client. That way if a slow client is blocking the handling routine for a long time, the other clients are still able to proceed with their request. The problem here is that a large number of clients creates a large number of threads. Context switching thousands of threads can be a massive performance issue. So for a small number of concurrent clients this is fine, but for large scale high performance servers we need something better.
You use a non-blocking API of the operating system. How exactly that works is different between operating systems. And even on a single OS there might exist different such APIs. Ususally you want to use a platform independed library if you need this type of concurrency support. An excellent library here is Boost Asio.
The two approaches can be mixed. For the best performance you would want to have as many threads as you have processor cores. Each thread handles requests concurrently using and asynchronous (non-blocking) API. This is usually done with a worker pool and a task queue.

How to limit Boost.Asio memory

I'm having trouble managing the work .post()'ed to Boost.Asio's io_context, having multiple questions about it (newbie warning).
Background: I'm writing a library that connects to a large number of different hosts for shorts periods at a time each (connect, send data, receive answer, close), and I figured using Boost.Asio. The documentation is scarce (too DRY?)
My current approach is this: (assuming a quad core machine): two physical cores run CPU bound sync operations, and post() additional work items to io_context. Two other threads are .run()ing and performing completion handlers.
1- The work scheduler
As per this amazing answer,
Boost.Asio may start some of the work as soon as it has been told about it, and other times it may wait to do the work at a later point in time.
When does boost.asio do what? On what basis is the queued work later processed?
2- Multiple Producers/ Multiple Consumers
As per This article,
At its core, Boost Asio provides a task execution framework that you can use to perform operations of any kind. You create your tasks as function objects and post them to a task queue maintained by Boost Asio. You enlist one or more threads to pick these tasks (function objects) and invoke them. The threads keep picking up tasks, one after the other till the task queues are empty at which point the threads do not block but exit.
I am failing to find a way to put a cap on the length of this task queue. This answer gives a couple of solutions, but they both involve locking, something I'd like to avoid as much as possible.
3- Are strands really necessary? How do I "disable them"
As detailed in this answer, boost uses an implicit strand per connection. Making potentially millions of connections, the memory savings by "bypassing" strands make sense to me. As the requests I make are independent (different host to each request), operations I make within a single connection is already serialized (callback chain) so I have no overlapping reads & writes, and no synchronization is expected from Boost.Asio. Does it make sense for me to try and bypass strands? If so, how?
4- Scaling design approach (A bit vague because I have no clue)
As stated in my background section, I'm running two io_contexts on two physical cores, each with two threads one for writing and one for reading. My goal here is to spew packets as fast as I can, and I have already
Compiled asio with BoringSSL (OpenSSL is a serious bottleneck)
Wrote my own c-ares resolver service to avoid async-ish DNS queries running in a thread loop.
But it still happens that my network driver starts timing out when multiple connections are opened. So how do I dynamically adjust boost.asio's throughput, the network adapter can cope with it?
My question(s) is most likely ill-informed as I'm no expert in network programming, and I know this a complex problem, I'd appreciate it if someone left pointers for me to look before closing the question or making it "dead".
Thank you.

IOCP Critical Section Design

I'm running an fully operational IOCP TCP socket application. Today I was thinking about the Critical Section design and now I have one endless question in my head: global or per client Critical Section? I came to this because as I see there is no point to use multiple working threads if every threads depends on a single lock, right? I mean... now I don't see any performance issue with 100 simultaneous clients, but what if was 10000?
My shared resource is per client pre allocated struct, so, each client have your own IO context, socket and stuff. There is no inter-client resource share, so I think that is another point for use the per client CS. I use one accept thread and 8 (processors * 2) working threads. This applications is basicaly designed for small (< 1KB) packets but sometimes for file streaming.
The "correct" answer probably depends on your design, the number of concurrent clients and the performance that you require from the hardware that you have available.
In general, I find it best to go with the simplest thing that works and then profile to locate hot spots.
However... You say that you have no inter-client shared resources so I assume the only synchronisation that you need to do is around 'per-connection' state.
Since it's per connection the obvious (to me) design would be for the per-connection state to contain its own critical section. What do you perceive to be the downside of this approach?
The problem with a single shared lock is that you introduce contention between connections (and threads) that have no reason to block each other. This will adversely affect performance and will likely become a hot-spot as connection numbers rise.
Once you have a per connection lock you might want to look at avoiding using it as often as possible by having the IOCP threads simply lock to place completions in a per connection queue for processing. This has the advantage of allowing a single IOCP thread to work on each connection and preventing a single connection from having additional IOCP threads blocking on it. It also works well with 'skip completion port on success' processing.

Designing a multi-client tcp server to process data

I am attempting to rewrite my current project to include more features and stability, and need some help designing it. Here is the jist of it (for linux):
TCP_SERVER receives connection (auth packet)
TCP_SERVER starts a new (thread/fork) to handle the new client
TCP_SERVER will be receiving many packets from client > which will be added to a circular buffer
A separate thread will be created for that client to process those packets and build a list of objects
Another thread should be created to send parts of the list of objects to another client
The reason to separate all the processing into threads is because server will be getting many packets and the processing wont be able to keep up (which needs to be quick, as its time sensitive) (im not sure if tcp will drop packets if the internal buffer gets too large?), and another thread to send to another client to keep the processing fast as possible.
So for each new connection, 3 threads should be created. 1 to receive packets, 1 to process them, and 1 to send the processed data to another client (which is technically the same person/ip just on a different device)
And i need help designing this, as how to structure this, what to use (forks/threads), what libraries to use.
Trying to do this yourself is going to cause you a world of pain. Focus on your actual application, and leverage an existing socket handling framework. For example, you said:
for each new connection, 3 threads should be created
That statement says the following:
1. You haven't done this before, at scale, and haven't realized the impact all these threads will have.
2. You've never benchmarked thread creation or synchronous operations.
3. The number of things that can go wrong with this approach is pretty overwhelming.
Give some serious thought to using an existing library that does most of this for you. Getting the scaffolding right around this can literally take years, and you're better off focusing on your code rather than all the random plumbing.
The Boost C++ libraries seem to have a nice Async C++ socket handling infrastructure. Combine this with some of the existing C++ thread pools and you could likely have a highly performant solution up fairly quickly.
I would also question your use of C++ for this. Java and C# both do highly scalable socket servers pretty well, and some of the higher level language tooling (Spring, Guarva, etc) can be very, very valuable. If you ever want to secure this, via TLS or another mechanism, you'll also probably find this much easier in Java or C# than in C++.
Some of the major things you'll care about:
1. True Async I/O will be a huge perf and scalability win. Try really hard to do this. The boost asio library looks pretty nice.
2. Focus on your features and stability, rather than building a new socket handling platform.
3. Threads are expensive, avoid creating them. Thread pools are your friend.
You plan to create one-or-more threads for every connection your server handles. Threads are not free, they come with a memory and CPU overhead, and when you have many active threads you also begin to have resource contention.
What usage pattern do you anticipate? Do you expect that when you have 8 connections, all 8 network threads will be consuming 100% of a cpu core pushing/pulling packets? Or do you expect them to have a relatively low turn-around?
As you add more threads, you will begin to have to spend more time competing for resources in things like mutexes etc.
A better pattern is to have one or more thread for network io - most os'es have mechanisms for saying "tell me when one or more of these network connections has io" which is an efficiency saving over having lots of individual threads all doing the same thing for just one connection.
Then for actual processing, spin up a pool of worker threads to do actual work, allowing you to minimize the competition for resources. You can monitor work load to determine if you need to spin up more to meet delivery requirements.
You might also want to look into something to implement the network IO infrastructure for you; I've had really good performance results with libevent but then I've only had to deal with very high performance/reliability networking systems.

Multiple threads for multiple ports?

I have a program (process) which needs to listen on 3 ports... Two are TCP and the other UDP.
The two TCP ports are going to be receiving large amounts of data every so often (could be as little as every 5 minutes or as often as every 20 seconds). The third (UDP) port is receiving constant data. Now, does it make sense to have these listening on different threads?
For instance, when I receive a large amount of data from one of the TCP ports, I don't want my UDP stream interrupted... are these common concerns for network programming?
I'll be using the Boost library on Windows if that has any bearing.
I'm just looking for some thoughts/ideas/guidance on this issue and how to manage multiple connections.
In general, avoid threads unless necessary. On a modern machine you will get better performance by using a single thread and using I/O readiness/completion features. On Windows this is IO Completion Ports, on Mac OS X and FreeBSD: kqueue(2), on Solaris: Event ports, on Linux epoll, on VMS QIO. Etc.
In boost, this is abstracted by boost::asio.
The place where threads would be useful is where you must do significant processing or perform a blocking operating system call that would add unacceptable latency to the rest of the networking processing.
Using threads, one per receiving connection, will help keep the throughput high and prevent one port's data from blocking the processing of another ports.
This is a good idea, especially since you're talking about only having 3 connections. Spawning three threads to handle the communication will make your application much easier to maintain and keep performant.
First,
...are these common concerns for network programming?
Yes, threading issues are very common concerns in network programming.
Second, your design idea of using three threads for listening on three different ports would work, allowing you to listen on all three ports simultaneously. As pointed out in the comments, this isn't the only way to do it.
Last, one design that's common in network programming is to have one thread listen for a connection, then spawn a new helper thread to handle processing the connection. The original listener thread just passes the socket connection off to the helper, then goes right back to listening for new connections.
Multiple threads do not scale. Operating system threads are far too heavy weight at the scale of thousands or tens of thousands of connections. Granted, you only have 3, so it's no big deal. In fact, I might even recommend it in your case in the name of simplicity if you are certain your application will not need to scale.
But at scale, you'd want to use select()/poll()/epoll()/libevent/etc. On modern Linux systems, epoll() is by far the most robust and is blazingly fast. One thread polls for socket readiness on all sockets simultaneously, and then sockets that signal as ready are either handled immediately by the single thread, or more often than not handed off to a thread pool. The thread pool has a limited number of threads (usually some ratio of the number of compute cores available on the local machine), wherein a free thread is grabbed whenever a socket is ready.
Again, you can use a thread per connection, but if you're interested in learning the proper way to build scalable network systems, don't. Building a select()/poll() style server is a fantastic learning experience.
For reference, see the C10K problem:
http://www.kegel.com/c10k.html