I am following this code of an C++ http server. One of the requirement is concurrency. That seems to be taken care of by the following chunk of code:
if(true) {
if(pthread_create(&thread, 0, handle_request, pcliefd) < 0) {
perror("pthread_create()");
}
} else {
handle_request(pcliefd);
}
I then come across a simpler code in this article. pthread is not used here. The response is handle by a write nested inside while(1). I suppose this simpler code does not meet the concurrency requirement? Anyways, what is the point of using thread to handle concurrency if the response is so simple? Is there something bigger behind this requirement?
The goal of your first linked question was to demonstrate a minimum of concurrency. Not a useful amount, just the bare minimum. Your second link doesn't even have that, as you correctly assumed.
Real webservers will be more complex. For starters, you don't want too much concurrency; there are only a limited number of CPU cores in your computer. See std::thread::hardware_conccurency
Anyways, what is the point of using thread to handle concurrency if the response is so simple?
This is actually a good question. The problem you face, when you want to handle a large number of clients is, that the read() and write() system calls are usually blocking. That means, they block your current thread as long as they take to complete the requested operation.
Say you have two clients, that send a request to your single threaded, non-concurrent server. Client A belongs to some lonely guy in a mountain hut with a real slow internet connection. Your listen() call returns and your program calls the handler routine for client A. Now while the bits slowly trickle through the mountain cable and your handler routine waits for the request to be transmitted, a second client B connects to your server. This one belongs to a business man at his high speed office internet access.
The problem here is, that even if your response is so simple, the high speed client still has to wait until your handler routine returns and can process the next request. One slow client can slow down all the other clients, which is obviously not what you want.
You can solve that problem using two approaches:
(that is the attempt in your code) you create a new thread for each client. That way if a slow client is blocking the handling routine for a long time, the other clients are still able to proceed with their request. The problem here is that a large number of clients creates a large number of threads. Context switching thousands of threads can be a massive performance issue. So for a small number of concurrent clients this is fine, but for large scale high performance servers we need something better.
You use a non-blocking API of the operating system. How exactly that works is different between operating systems. And even on a single OS there might exist different such APIs. Ususally you want to use a platform independed library if you need this type of concurrency support. An excellent library here is Boost Asio.
The two approaches can be mixed. For the best performance you would want to have as many threads as you have processor cores. Each thread handles requests concurrently using and asynchronous (non-blocking) API. This is usually done with a worker pool and a task queue.
Related
I'm having trouble managing the work .post()'ed to Boost.Asio's io_context, having multiple questions about it (newbie warning).
Background: I'm writing a library that connects to a large number of different hosts for shorts periods at a time each (connect, send data, receive answer, close), and I figured using Boost.Asio. The documentation is scarce (too DRY?)
My current approach is this: (assuming a quad core machine): two physical cores run CPU bound sync operations, and post() additional work items to io_context. Two other threads are .run()ing and performing completion handlers.
1- The work scheduler
As per this amazing answer,
Boost.Asio may start some of the work as soon as it has been told about it, and other times it may wait to do the work at a later point in time.
When does boost.asio do what? On what basis is the queued work later processed?
2- Multiple Producers/ Multiple Consumers
As per This article,
At its core, Boost Asio provides a task execution framework that you can use to perform operations of any kind. You create your tasks as function objects and post them to a task queue maintained by Boost Asio. You enlist one or more threads to pick these tasks (function objects) and invoke them. The threads keep picking up tasks, one after the other till the task queues are empty at which point the threads do not block but exit.
I am failing to find a way to put a cap on the length of this task queue. This answer gives a couple of solutions, but they both involve locking, something I'd like to avoid as much as possible.
3- Are strands really necessary? How do I "disable them"
As detailed in this answer, boost uses an implicit strand per connection. Making potentially millions of connections, the memory savings by "bypassing" strands make sense to me. As the requests I make are independent (different host to each request), operations I make within a single connection is already serialized (callback chain) so I have no overlapping reads & writes, and no synchronization is expected from Boost.Asio. Does it make sense for me to try and bypass strands? If so, how?
4- Scaling design approach (A bit vague because I have no clue)
As stated in my background section, I'm running two io_contexts on two physical cores, each with two threads one for writing and one for reading. My goal here is to spew packets as fast as I can, and I have already
Compiled asio with BoringSSL (OpenSSL is a serious bottleneck)
Wrote my own c-ares resolver service to avoid async-ish DNS queries running in a thread loop.
But it still happens that my network driver starts timing out when multiple connections are opened. So how do I dynamically adjust boost.asio's throughput, the network adapter can cope with it?
My question(s) is most likely ill-informed as I'm no expert in network programming, and I know this a complex problem, I'd appreciate it if someone left pointers for me to look before closing the question or making it "dead".
Thank you.
I am attempting to rewrite my current project to include more features and stability, and need some help designing it. Here is the jist of it (for linux):
TCP_SERVER receives connection (auth packet)
TCP_SERVER starts a new (thread/fork) to handle the new client
TCP_SERVER will be receiving many packets from client > which will be added to a circular buffer
A separate thread will be created for that client to process those packets and build a list of objects
Another thread should be created to send parts of the list of objects to another client
The reason to separate all the processing into threads is because server will be getting many packets and the processing wont be able to keep up (which needs to be quick, as its time sensitive) (im not sure if tcp will drop packets if the internal buffer gets too large?), and another thread to send to another client to keep the processing fast as possible.
So for each new connection, 3 threads should be created. 1 to receive packets, 1 to process them, and 1 to send the processed data to another client (which is technically the same person/ip just on a different device)
And i need help designing this, as how to structure this, what to use (forks/threads), what libraries to use.
Trying to do this yourself is going to cause you a world of pain. Focus on your actual application, and leverage an existing socket handling framework. For example, you said:
for each new connection, 3 threads should be created
That statement says the following:
1. You haven't done this before, at scale, and haven't realized the impact all these threads will have.
2. You've never benchmarked thread creation or synchronous operations.
3. The number of things that can go wrong with this approach is pretty overwhelming.
Give some serious thought to using an existing library that does most of this for you. Getting the scaffolding right around this can literally take years, and you're better off focusing on your code rather than all the random plumbing.
The Boost C++ libraries seem to have a nice Async C++ socket handling infrastructure. Combine this with some of the existing C++ thread pools and you could likely have a highly performant solution up fairly quickly.
I would also question your use of C++ for this. Java and C# both do highly scalable socket servers pretty well, and some of the higher level language tooling (Spring, Guarva, etc) can be very, very valuable. If you ever want to secure this, via TLS or another mechanism, you'll also probably find this much easier in Java or C# than in C++.
Some of the major things you'll care about:
1. True Async I/O will be a huge perf and scalability win. Try really hard to do this. The boost asio library looks pretty nice.
2. Focus on your features and stability, rather than building a new socket handling platform.
3. Threads are expensive, avoid creating them. Thread pools are your friend.
You plan to create one-or-more threads for every connection your server handles. Threads are not free, they come with a memory and CPU overhead, and when you have many active threads you also begin to have resource contention.
What usage pattern do you anticipate? Do you expect that when you have 8 connections, all 8 network threads will be consuming 100% of a cpu core pushing/pulling packets? Or do you expect them to have a relatively low turn-around?
As you add more threads, you will begin to have to spend more time competing for resources in things like mutexes etc.
A better pattern is to have one or more thread for network io - most os'es have mechanisms for saying "tell me when one or more of these network connections has io" which is an efficiency saving over having lots of individual threads all doing the same thing for just one connection.
Then for actual processing, spin up a pool of worker threads to do actual work, allowing you to minimize the competition for resources. You can monitor work load to determine if you need to spin up more to meet delivery requirements.
You might also want to look into something to implement the network IO infrastructure for you; I've had really good performance results with libevent but then I've only had to deal with very high performance/reliability networking systems.
Regarding the answer on: How game servers with Boost:Asio work asynchronously?
What if I have a server which does calculations and at the same time sends/receive packets from clients?
I mean if I was coding a http-server the example on the answer would suffice since all the data sent are functions of the data received.
Assume my program calculates values and needs to update clients according to their needs (some may want update frequency 1 hz, where another one 10 hz etc).
This kind of structure would be very helpful to me:
while(1){
pollNetworking(); //<- my function
value1 += 5; value2 = random();
}
In my pollNetworking function I was thinking of calling something like acceptor.accept(*socket,10); where 10 is the timeout in milliseconds but since there is no timeout parameter I don't know how to structure this.
Scalability is not the biggest issue, can I spawn a thread per socket,an extra thread for accepting and another one for calculations? Will this be easy to implement? Because I want this to be as stable as possible, then comes speed, then comes scalability. And when it comes to multi-threading I don't trust myself that I can code&debug it cleanly yet.
Edit: I learned that I can use io_service::poll, which only dispatch ready events without blocking. So it is a synchronous function with 0 timeout, exactly as I needed.
The server can do calculations at the same time as data is being sent and received from the client. However, the buffers and socket will likely need to be protected from concurrency access.
For most Boost.Asio operations, portable timeout functionality is only possible on asynchronous actions. This requires issuing an async operation on an entity, setting a timer, then waiting. For an example of canceling async_read with a timeout, see this question.
The simplest, and less scalable, approach is to designate a thread per responsibility (thread per socket, accepting, and calculations). Synchronization will likely need to occur, such as protecting calculation results. For example, if value1 and value2 are only meaningful in the same iteration, then socket threads need to guarantee that the values are written together without the calculation thread changing the values mid-write. Various synchronization constructs, such as those provided by Boost.Thread, can be used to accomplish this. Also, it may be easier to implement and debug by minimizing the amount of asynchronous calls being used.
For a much scalable approach, most of the program will be written as a series of handlers invoked from asynchronous operations. This allows for the program to take advantage of threads and thread pools much easier. However, it can scatter program logic across numerous functions, and can quickly become difficult to follow. Often times, programs written with asynchronous actions in mind will perform synchronization with boost::asio::strand, and manage object lifetimes through boost::shared_ptr.
The ease of implementation will depend on experience. Keep in mind that network programming, concurrency, and asynchronous operations are innately difficult. There is rarely solution that is both simple and complete.
You can still have asynchronous accept and receive, but send to the clients synchronous whenever you need to send to them.
If you can use separate threads for each connected client (I'm guessing you won't be expecting hundreds or thousands of connections) then you can use one thread per connected client for both calculations and sending, while keeping the receiving asynchronous.
Im designing a c++ client application that listens to multiple ports for streams of short messages. After reading up on ACE, POCO, boost::asio and all the Proactor like design patterns, I am about to start with boost::asio.
One thing I notice it a constant theme of using Asynchronous socket IO, yet have not read a good problem description that async io solves. Are all these design patterns based on the assumption of HTTP web server design?
Since web servers are the most common application for complex latency sensitive concurrent socket programming, im starting to wonder if most of these patterns/idioms are geared for this one application.
My application will listen to a handful of sockets for short and frequent messages. A separate thread will need to combine all messages for processing. One thing that I am looking at design patterns for, is separating the connection management from the data processing. I wish to have the connections to try reconnect after a disconnect and have the processing thread continue as if nothing happened. What design pattern is recommended here?
I dont see how async io will improve performance in my case.
You're on the right track. It's smart to ask "why" especially with all the hype around asynchronous and event driven. There are applications other than the web. Consider message queues and financial transactions like high frequency trading. Basically any time that waiting costs money or loses an opportunity to serve a customer is a candidate for async. The web is a big example because the network is so much faster than the database. As always, ask "does this make sense" for your app. Async adds a lot of complexity if you're not benefiting from it.
Your short, rapid messages may actually benefit a lot from async if the mean time between messages is comparable to the time required to process each message, especially if that processing includes persistence. But you don't have to rush into async. Instrument your blocking code and see whether you really have a bottleneck.
I hope this helps.
Using the blocking calls pattern would entail:
1. Listening on a socket vector of size N
2. When a message arrives, you wake up with a start in time K, find and start processing the message, employing a time T (it does not matter if the processing is offloaded to another thread: in this case T becomes your offloading time)
3. You finish examining the vector and GOTO 1
So you might say that if M messages arrive, and another message arrives during the K+M*T dispatching time, the M+1-th message will find itself waiting K+M*T time. Is this acceptable for your expected values of K (constant), M (function of traffic) and T (function of resources and system load)?
Asynchronous processing, well, actually does not exist. There will always be a "synchronous" IO loop somewhere, only it will be so well integrated in the kernel (or even hardware) that it will run 10-100x faster than your own, and therefore will be likely to scale better with large values of M. The delay is still in the form K1+M*T1, but this time K1 and T1 are much lower. Or maybe K1 is a bit higher and T1 is significantly lower: the architecture "scales better" for large values of M.
If your values of M are usually low, then the advantages of asynchronicity are proportionally smaller. In the absurd case when you only have one message in the application lifetime, synchronous or asynchronous makes next to no difference.
Take into account another factor: if the number of messages becomes really large, asynchronicity has its advantages; but if the messages themselves are independent (the changes caused by message A do not influence processing of message B), then you can remain synchronous and scale horizontally, preparing a number Z of "message concentrators" each receiving a fraction M/Z of the total traffic.
If the processing requires performing other calls to other services (cache, persistence, information retrieval, authentication...), increasing the T factor, then you'll be better off turning to multithreaded or even asynchronous. With multithreading you shave T to a fraction of its value (the dispatching time only). Asynchronous in a sense does the same, but shaving even more, and taking care of more programming boilerplate for you.
I am currently writing a bittorrent client. I am getting to the stage in my program where I need to start thinking about whether multiple threads would improve my program and how many I would need.
I assume that I would assign one thread to deal with the trackers because the program may be in contact with several (1-5 roughly) of them at once, but will only need to contact them in an interval assigned by the tracker (around 20 minutes), so won't be very intensive on the program.
The program will be in regular contact with numerous peers to download pieces of files from them. The following is taken from the Bittorrent Specification Wiki:
Implementer's Note: Even 30 peers is plenty, the official client version 3 in fact only actively forms new connections if it has less than 30 peers and will refuse connections if it has 55. This value is important to performance. When a new piece has completed download, HAVE messages (see below) will need to be sent to most active peers. As a result the cost of broadcast traffic grows in direct proportion to the number of peers. Above 25, new peers are highly unlikely to increase download speed. UI designers are strongly advised to make this obscure and hard to change as it is very rare to be useful to do so.
It suggests that I should be in contact with roughly 30 peers. What would be a good thread model to use for my Bittorrent Client? Obviously I don't want to assign a thread to each peer and each tracker, but I will probably need more than just the main thread. What do you suggest?
I don't see a lot of need for multithreading here. Having too many threads also means having a lot of communication between these to make sure everyone is doing the right thing at the right time.
For the networking, keep everything on one thread and just multiplex using nonblocking I/O. On Unix systems this would be a setup with select/poll (or platform-specific extensions such as epoll); on Windows this would be completion ports.
You can even add the disk I/O into this, which would make the communication between the threads trivial since there isn't any :-)
If you want to consider threads to be containers for separate components, the disk I/O could go into another thread. You could use blocking I/O in this case, since there isn't a lot of multiplexing anyway.
Likewise, in such a scenario, tracker handling could go into a different thread as well since it's a different component from peer handling. Same for DHT.
You might want to offload the checksum-checking to a separate thread. Not quite sure how complex this gets, but if there's significant CPU use involved then putting it away from the I/O stuff doesn't sound that bad.
As you tagged your question [C++] I suggest std:thread of C++11 . A nice tutorial (among lots of others) you find here.
Concerning the number of threads: You can use 30 threads without any problem and have them check whether there is something to do for them and putting them to sleep for a reasonable time between the checks. The operating system will take care of the rest.