Regarding the answer on: How game servers with Boost:Asio work asynchronously?
What if I have a server which does calculations and at the same time sends/receive packets from clients?
I mean if I was coding a http-server the example on the answer would suffice since all the data sent are functions of the data received.
Assume my program calculates values and needs to update clients according to their needs (some may want update frequency 1 hz, where another one 10 hz etc).
This kind of structure would be very helpful to me:
while(1){
pollNetworking(); //<- my function
value1 += 5; value2 = random();
}
In my pollNetworking function I was thinking of calling something like acceptor.accept(*socket,10); where 10 is the timeout in milliseconds but since there is no timeout parameter I don't know how to structure this.
Scalability is not the biggest issue, can I spawn a thread per socket,an extra thread for accepting and another one for calculations? Will this be easy to implement? Because I want this to be as stable as possible, then comes speed, then comes scalability. And when it comes to multi-threading I don't trust myself that I can code&debug it cleanly yet.
Edit: I learned that I can use io_service::poll, which only dispatch ready events without blocking. So it is a synchronous function with 0 timeout, exactly as I needed.
The server can do calculations at the same time as data is being sent and received from the client. However, the buffers and socket will likely need to be protected from concurrency access.
For most Boost.Asio operations, portable timeout functionality is only possible on asynchronous actions. This requires issuing an async operation on an entity, setting a timer, then waiting. For an example of canceling async_read with a timeout, see this question.
The simplest, and less scalable, approach is to designate a thread per responsibility (thread per socket, accepting, and calculations). Synchronization will likely need to occur, such as protecting calculation results. For example, if value1 and value2 are only meaningful in the same iteration, then socket threads need to guarantee that the values are written together without the calculation thread changing the values mid-write. Various synchronization constructs, such as those provided by Boost.Thread, can be used to accomplish this. Also, it may be easier to implement and debug by minimizing the amount of asynchronous calls being used.
For a much scalable approach, most of the program will be written as a series of handlers invoked from asynchronous operations. This allows for the program to take advantage of threads and thread pools much easier. However, it can scatter program logic across numerous functions, and can quickly become difficult to follow. Often times, programs written with asynchronous actions in mind will perform synchronization with boost::asio::strand, and manage object lifetimes through boost::shared_ptr.
The ease of implementation will depend on experience. Keep in mind that network programming, concurrency, and asynchronous operations are innately difficult. There is rarely solution that is both simple and complete.
You can still have asynchronous accept and receive, but send to the clients synchronous whenever you need to send to them.
If you can use separate threads for each connected client (I'm guessing you won't be expecting hundreds or thousands of connections) then you can use one thread per connected client for both calculations and sending, while keeping the receiving asynchronous.
Related
I am following this code of an C++ http server. One of the requirement is concurrency. That seems to be taken care of by the following chunk of code:
if(true) {
if(pthread_create(&thread, 0, handle_request, pcliefd) < 0) {
perror("pthread_create()");
}
} else {
handle_request(pcliefd);
}
I then come across a simpler code in this article. pthread is not used here. The response is handle by a write nested inside while(1). I suppose this simpler code does not meet the concurrency requirement? Anyways, what is the point of using thread to handle concurrency if the response is so simple? Is there something bigger behind this requirement?
The goal of your first linked question was to demonstrate a minimum of concurrency. Not a useful amount, just the bare minimum. Your second link doesn't even have that, as you correctly assumed.
Real webservers will be more complex. For starters, you don't want too much concurrency; there are only a limited number of CPU cores in your computer. See std::thread::hardware_conccurency
Anyways, what is the point of using thread to handle concurrency if the response is so simple?
This is actually a good question. The problem you face, when you want to handle a large number of clients is, that the read() and write() system calls are usually blocking. That means, they block your current thread as long as they take to complete the requested operation.
Say you have two clients, that send a request to your single threaded, non-concurrent server. Client A belongs to some lonely guy in a mountain hut with a real slow internet connection. Your listen() call returns and your program calls the handler routine for client A. Now while the bits slowly trickle through the mountain cable and your handler routine waits for the request to be transmitted, a second client B connects to your server. This one belongs to a business man at his high speed office internet access.
The problem here is, that even if your response is so simple, the high speed client still has to wait until your handler routine returns and can process the next request. One slow client can slow down all the other clients, which is obviously not what you want.
You can solve that problem using two approaches:
(that is the attempt in your code) you create a new thread for each client. That way if a slow client is blocking the handling routine for a long time, the other clients are still able to proceed with their request. The problem here is that a large number of clients creates a large number of threads. Context switching thousands of threads can be a massive performance issue. So for a small number of concurrent clients this is fine, but for large scale high performance servers we need something better.
You use a non-blocking API of the operating system. How exactly that works is different between operating systems. And even on a single OS there might exist different such APIs. Ususally you want to use a platform independed library if you need this type of concurrency support. An excellent library here is Boost Asio.
The two approaches can be mixed. For the best performance you would want to have as many threads as you have processor cores. Each thread handles requests concurrently using and asynchronous (non-blocking) API. This is usually done with a worker pool and a task queue.
I am working on designing a websocket server which receives a message and saves it to an embedded database. For reading the messages I am using boost asio. To save the messages to the embedded database I see a few options in front of me:
Save the messages synchronously as soon as I receive them over the same thread.
Save the messages asynchronously on a separate thread.
I am pretty sure the second answer is what I want. However, I am not sure how to pass messages from the socket thread to the IO thread. I see the following options:
Use one io service per thread and use the post function to communicate between threads. Here I have to worry about lock contention. Should I?
Use Linux domain sockets to pass messages between threads. No lock contention as far as I understand. Here I can probably use BOOST_ASIO_DISABLE_THREADS macro to get some performance boost.
Also, I believe it would help to have multiple IO threads which would receive messages in a round robin fashion to save to the embedded database.
Which architecture would be the most performant? Are there any other alternatives from the ones I mentioned?
A few things to note:
The messages are exactly 8 bytes in length.
Cannot use an external database. The database must be embedded in the running
process.
I am thinking about using RocksDB as the embedded
database.
I don't think you want to use a unix socket, which is always going to require a system call and pass data through the kernel. That is generally more suitable as an inter-process mechanism than an inter-thread mechanism.
Unless your database API requires that all calls be made from the same thread (which I doubt) you don't have to use a separate boost::asio::io_service for it. I would instead create an io_service::strand on your existing io_service instance and use the strand::dispatch() member function (instead of io_service::post()) for any blocking database tasks. Using a strand in this manner guarantees that at most one thread may be blocked accessing the database, leaving all the other threads in your io_service instance available to service non-database tasks.
Why might this be better than using a separate io_service instance? One advantage is that having a single instance with one set of threads is slightly simpler to code and maintain. Another minor advantage is that using strand::dispatch() will execute in the current thread if it can (i.e. if no task is already running in the strand), which may avoid a context switch.
For the ultimate optimization I would agree that using a specialized queue whose enqueue operation cannot make a system call could be fastest. But given that you have network i/o by producers and disk i/o by consumers, I don't see how the implementation of the queue is going to be your bottleneck.
After benchmarking/profiling I found the facebook folly implementation of MPMC Queue to be the fastest by at least a 50% margin. If I use the non-blocking write method, then the socket thread has almost no overhead and the IO threads remain busy. The number of system calls are also much less than other queue implementations.
The SPSC queue with cond variable in boost is slower. I am not sure why that is. It might have something to do with the adaptive spin that folly queue uses.
Also, message passing (UDP domain sockets in this case) turned out to be orders of magnitude slower especially for larger messages. This might have something to do with copying of data twice.
You probably only need one io_service -- you can create additional threads which will process events occurring within the io_service by providing boost::asio::io_service::run as the thread function. This should scale well for receiving 8-byte messages from clients over the network socket.
For storing the messages in the database, it depends on the database & interface. If it's multi-threaded, then you might as well just send each message to the DB from the thread that received it. Otherwise, I'd probably set up a boost::lockfree::queue where a single reader thread pulls items off and sends them to the database, and the io_service threads append new messages to the queue when they arrive.
Is that the most efficient approach? I dunno. It's definitely simple, and gives you a baseline that you can profile if it's not fast enough for your situation. But I would recommend against designing something more complicated at first: you don't know whether you'll need it at all, and unless you know a lot about your system, it's practically impossible to say whether a complicated approach would perform any better than the simple one.
void Consumer( lockfree::queue<uint64_t> &message_queue ) {
// Connect to database...
while (!Finished) {
message_queue.consume_all( add_to_database ); // add_to_database is a Functor that takes a message
cond_var.wait_for( ... ); // Use a timed wait to avoid missing a signal. It's OK to consume_all() even if there's nothing in the queue.
}
}
void Producer( lockfree::queue<uint64_t> &message_queue ) {
while (!Finished) {
uint64_t m = receive_from_network( );
message_queue.push( m );
cond_var.notify_all( );
}
}
Assuming that the constraint of using cxx11 is not too hard in your situtation, I would try to use the std::async to make an asynchronous call to the embedded DB.
I am building a system that sends and receives UDP packets to multiple pieces of remote hardware.
A function mySend passes new information to send to a third-party API that I must use to construct the actual UDP datagram. The API locks a mutex during its work constructing and sending the datagram.
A function myRecv runs in a worker thread, repeatedly asking the third-party API to poll for new data. The API invokes a UDP-receive function which runs select and recvfrom to grab any responses from the remote hardware.
The thread that listens and handles incoming packets is problematic at the moment due to the design of the API I'm using to decode those packets, which locks its own mutex around the call to the UDP-receive function. But this function performs a blocking select.
The consequence is that the mutex is almost always locked by the receive thread and, in fact, the contention is so bad that mySend is practically never able to obtain the lock. The result is that the base thread is effectively deadlocked.
To fix this, I'm trying to justify making the listen socket non-blocking and performing a usleep between select calls where no data was available.
Now, if my blocking select had a 3-second timeout, that's not the same as performing a non-blocking select every 3 seconds (in the worst case) because of the introduction of latency in looking for and consequently handling incoming packets. So the usleep period has to be a lot lower, say 300-500ms.
My concern is mostly in the additional system calls — this is a lot more calls to select, and new calls to usleep. At times I will expect next to no incoming data for tens of seconds or even minutes, but there will also likely be periods during which I might expect to receive perhaps 40KB over a few seconds.
My first instinct, if this were all my own software, would be to tighten up the use of mutexes such that no locking was in place around select at all, and then there'd be no problem. But I'd like to avoid hacking about in the 3rd-party API if I don't have to.
Simple time-based profiling is not really enough at this stage because this mechanism needs to scale really well, and I don't have the means to test at scale right now. Consequently I'm trying to gather some anecdotal evidence in order to steer my decision-making.
Is moving to a non-blocking socket the right approach?
Or would I be better off hacking up the third-party API (which I'd rather not do) to tighten their mutex usage?
I, my team and the developers of the 3rd party library have all come to the conclusion that the hack is suitable enough for deployment, and outweighs the questions posed and disadvantages associated with my potential alternative workarounds.
The real solution is, of course, to push a proper design fix into the 3rd party library; this is a way off as it would be fairly extensive and nobody really cares enough, but it does give us the answer to this question.
I have been trying to understand the logic in boost's http server 3 example. The request in this example is read inside connection.cpp, in the start() method, which calls:
socket_.async_read_some(boost::asio::buffer(buffer_),
strand_.wrap(
boost::bind(&connection::handle_read, shared_from_this(),
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred)));
Note that the async_read_some method is documented to return immediately. Then inside the read handler (connection::handle_read()), we may again call async_read_some if parse returns boost::indeterminate. What benefit does this provide over socket_.read_some(buffer), given that we already know we are working in a separate thread. The reason I ask is I want to change the message parsing a bit to call read_some on demand, but the method I have in mind won't work with an async read.
Also, a related question: is there any difference between
async_read_some()
and
boost::thread th([](){ ret = read_some(); handle_read(ret) });?
Boost.Asio's HTTP Server 3's example is coded in a way that it remains agnostic to the size of the thread pool. As such, there is no guarantee that work will be done in separate threads. Nevertheless, the benefit in being agnostic is that it scales better with more connections. For example, consider the C10K problem that examines 10000 clients simultaneously connected. A synchronous solution may run into various performance issues or resource limitations with 10000 clients. Additionally, the asynchronous nature helps insulate the program from behavior changes in the network. For instance, consider a synchronous program that has 3 clients and 2 threads, but 2 of the clients have high latency due to an increase in noise on the network. The 3rd client could be inadvertently affected if both of the threads are blocked waiting for data from the other clients.
If there is a low and finite number of connections, with each connection serviced by a thread, then the performance difference between a synchronous and asynchronous server may be minimal. When possible, it is often advised to avoid mixing asynchronous and synchronous programming, as it can turn a complex solution into a complicated one. Furthermore, most synchronous algorithms can be written asynchronously.
There are two major differences between an asynchronous operation and a synchronous operation (even those running within a dedicated thread):
Thread safety. As noted in the documentation:
In general, it is safe to make concurrent use of distinct objects, but unsafe to make concurrent use of a single object.
Therefore, asynchronous and synchronous operations cannot safely be initiated while a synchronous operation is in progress, even if the operation is invoked within its own thread. This may be minimal in a half duplex protocol, but should be considered with full duplex protocols.
Ability to cancel an operation. As noted in this answer, synchronous operations cannot be cancelled through the cancel() member functions Boost.Asio provides. Instead, the application may need to use lower level mechanics, such as signals.
I am implementing custom server that needs to maintain very large number (100K or more) of long lived connections. Server simply passes messages between sockets and it doesn't do any serious data processing. Messages are small, but many of them are received/send every second. Reducing latency is one of the goals. I realize that using multiple cores won't improve performance and therefore I decided to run the server in a single thread by calling run_one or poll methods of io_service object. Anyway multi-threaded server would be much harder to implement.
What are the possible bottlenecks? Syscalls, bandwidth, completion queue / event demultiplexing? I suspect that dispatching handlers may require locking (that is done internally by asio library). Is it possible to disable even queue locking (or any other locking) in boost.asio?
EDIT: related question. Does syscall performance improve with multiple threads? My feeling is that because syscalls are atomic/synchronized by the kernel adding more threads won't improve speed.
You might want to read my question from a few years ago, I asked it when first investigating the scalability of Boost.Asio while developing the system software for the Blue Gene/Q supercomputer.
Scaling to 100k or more connections should not be a problem, though you will need to be aware of the obvious resource limitations such as the maximum number of open file descriptors. If you haven't read the seminal C10K paper, I suggest reading it.
After you have implemented your application using a single thread and a single io_service, I suggest investigating a pool of threads invoking io_service::run(), and only then investigate pinning an io_service to a specific thread and/or cpu. There are multiple examples included in the Asio documentation for all three of these designs, and several questions on SO with more information. Be aware that as you introduce multiple threads invoking io_service::run() you may need to implement strands to ensure the handlers have exclusive access to shared data structures.
Using boost::asio you can write single-thread or multi-thread server approximately at same development cost. You can write single-threaded version as first version, then convert it to multithreaded, if needed.
Typically, only bottleneck for boost::asio is that epoll/kqueue reactor is working in a mutex. So, only one thread is doing epoll at same time. This can decrease performance in case when you have multithreaded server, which serves lots and lots very small packets. But, imo it anyway should be faster than just plain-singlethread server.
Now about your task. If you want to just pass messages between connections - i think it must be multithreaded server. The problem is syscalls(recv/send etc). An instruction is very easy think to do for CPU, but any syscall is not very "light" operation (everything is relative, but relative to other jobs in your task). So, with single thread you will get big syscalls overhead, its why i recommend to use multithreaded scheme.
Also, you can separate io_service and make it work as "io_service per thread" idiom. I think this must give best performance, but it has drawback: if one of io_service will get too big queue - other threads will not help it, so some connections may slowdown. On other side, with single io_service - queue overrun can lead to big locking overhead. All you can do - do the both variants and measure bandwidth/latency. It should be not too difficult to implement both variants.