I was attempting to understand Boost Asio implementation and limitations. As I understand from here - https://www.boost.org/doc/libs/1_75_0/doc/html/boost_asio/overview/core/basics.html
When you do an async_receive_from call on a socket, the following things happen
The socket forwards the request to the I/O execution context.
The I/O execution context signals to the operating system that it should start an asynchronous connect.
The operating system indicates that the connect operation has completed by placing the result on a queue, ready to be picked up by the I/O execution context.
When using an io_context as the I/O execution context, your program must make a call to io_context::run() (or to one of the similar io_context member functions) in order for the result to be retrieved. A call to io_context::run() blocks while there are unfinished asynchronous operations, so you would typically call it as soon as you have started your first asynchronous operation.
Assuming I have very high throughput of data coming in, what I'm trying to understand is
Is there a possibility of data loss in step 2 above where IO execution context signals OS to perform the async receive operation? Can the OS get somehow overwhelmed with the volume of asynchronous reads?
In step 3 above, OS puts completed reads in a queue. What is the capacity of this queue? Can this queue overflow if for example, there was a burst of network traffic and all the threads running io_context::run() are occupied, hence read data keeps accumulating in the queue? Is this queue bounded or unbounded?
The ASIO code is open-source, but I'm fairly new to C++ and am finding it a little difficult to understand the code. Appreciate any help on these questions. Thanks!
There's no buffering in ASIO whatsoever; ASIO is a thin wrapper around native OS select/epoll/kqueue/IOCP (depending on OS) as well as non-blocking send/recv calls.
Your question can thus be re-phased as "what happens when I don't call recv fast enough?". As it turns out, that question has already been asked before, see What happens if one doesn't call POSIX's recv “fast enough”?.
Anyway, to answer the specific questions:
1. Is there a possibility of data loss in step 2 above where IO execution context signals OS to perform the async receive operation? Can the OS get somehow overwhelmed with the volume of asynchronous reads?
The OS can't get overwhelmed with async receive calls because you can have at most 1 active async receive and send per socket, and the number of sockets is limited.
2. ... What is the capacity of this queue? Can this queue overflow if for example, there was a burst of network traffic and all the threads running io_context::run() are occupied, hence read data keeps accumulating in the queue? Is this queue bounded or unbounded?
The queueing characteristics of a TCP stream are determined by the TCP receive buffer and TCP receive window. These are configurable in most modern OSes, and can even by dynamic. The receive buffer is bounded, and if you don't receive fast enough, TCP has built-in mechanisms to signal the sending side to slow down/retransmit (a.k.a. TCP Flow Control).
Similarly UDP has a receive buffer. When that one gets full, new incoming packets are dropped.
Related
How independent is the handling of UDP send and receive on same socket in Linux kernel? The use case I have is a worker thread sending UDP test traffic on (up to) 1000 sockets, and receiving the UDP replies in another worker thread. The receiver will be an epoll loop that also receives hardware send and receive timestamps on the socket error queue.
To clarify, when doing a sendmsg() syscall, will this temporarily block (or generate EAGAIN/EWOULDBLOCK) on the receiver thread receiving on the same socket? (i.e. if the send and receive happen to overlap in time) All sockets are set to non-blocking mode.
Another question is granularity of locking in the kernel - if I send and receive with sendmmsg/recvmmsg, is a lock for that socket locked once per sendmmsg, or once per UDP datagram in the sendmmsg?
UPDATE: I took a look at the original patch for sendmmsg in Linux kernel, seems the main benefit is avoiding multiple transitions user-kernel space. If any locking is done, it is probably done inside the individual calls to __sys_sendmsg:
https://lwn.net/Articles/441169/
Each system call is thread independent. So, as far as you don't involve per process kernel data, both will run independently without disturbing each other.
Another different thing is what the kernel does with system calls related to the same inode (in this case, the virtual node assigned to the socket you use to communicate) To serialize and make atomic calls to the filesystem, the kernel normally does an inode lock during the whole system call (being this a read, write or ioctl system call) that stands for the whole system call (even if you do a unique write call to write a zillion bytes, the inode is blocked during the execution of the whole system call)
In the tcp-ip stack, this is made at the socket level, and is controlled in your case by the specific AF_INET socket class software. As udp is concerned, sending a packet or receiving doesn't affect shared resources that need to be locked, but you'll have to look at your udp implementation (or socket level) to see if some locking is done and what the granularity is. Normally, the lock, should it be, should be used only during the load/unload of the udp buffers (normally there aren't buffers in udp, as the socket and network card driver are enough to supply enough buffer resources.
I'm trying to get to grips with boost asio but I'm having trouble understanding some of the behavior behind the asynchronous interface.
I have a simple setup with a client and a server.
The client calls async_write regularly with a fixed amount of data
The server polls for data regularly
What happens when the server stops polling for data ?
I guess the various buffers would fill up in the server OS and it would stop sending ACKs ?
Regardless of what happens it seems that the client can happily continue to send several gigabytes of data without receiving any error callback (doesn't receive any success either of course).
I assume the client OS stops accepting packets at one point since they can't be TX'ed ?
Does this means that boost::asio buffers data internally ?
If it does, can I use socket.cancel() to drop packets in case I don't want to wait for delivery ? (I need to make sure ASIO forgets about my packets so I can reuse old buffers for new packets)
asio doesn't buffer internally. And you will always get signaled if you can't transfer more data to the remote.
E.g. if you use synchronous writes in asio they will block until the data could be sent (or at least be copied into the kernel send buffers). If you use async writes the callback/acknowledgement will only be called once it could be sent. If you use nonblocking writes you get EAGAIN/WOULD_BLOCK errors. If you use multiple async_write's in parallel - well - you shouldn't do that, it's behavior is undefined according to the asio docs:
This operation is implemented in terms of zero or more calls to the stream's async_write_some function, and is known as a composed operation. The program must ensure that the stream performs no other write operations (such as async_write, the stream's async_write_some function, or any other composed operations that perform writes) until this operation completes.
Guarantee in your application that you always only perform a single async write operation and once that finishes write the next piece of data. If you need to write data in between you would need to buffer that inside your application.
I'm a beginner in boost::asio.
I need to code a module which reads from a pipe and puts the data into a ring buffer (I've no problem in how to implement this part).
Another part of the module waits for a consumer to open a new TCP connection or unix domain socket and when the connection is made it sends the full ring buffer contents and then it will send any new data as soon as it is pushed into the ring buffer. Multiple consumers are allowed and one consumer can open a new connection at any time.
The first naive implementation I thought of is to keep a separate asio::streambuf for every connection and push the entire ring buffer into it on connection and then every new data, but it seems a very sub-optimal method to do it both in memory and cpu cycles as data has to be copyed for every connection, maybe multiple times as I don't know if boost::asio::send (or the linux tcp/ip stack) does a copy of the data.
As my idea is to use no multi threading at all, I'm thinking of using some form of custom asio::streambuf derived class which shares the actual buffer with the ring buffer, but keeps a separate state of the read pointer without the need of any lock.
It seems mine it is a pretty unusual need, because I'm unable to find any related documentation/question which deals with a similar subject and the boost documentation seems pretty brief and scarce to me (see e.g.: http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio/reference/basic_streambuf.html).
It would be nice if someone could point me to some ideas that I could take as starting point to implement my design or point me to an alternative design if he/she considers mine bad, un-implementable and/or improvable.
You should just do what you intend to.
You absolutely don't need a streambuf to use with Boost Asio: http://www.boost.org/doc/libs/release/doc/html/boost_asio/reference/buffer.html
If the problem is how to avoid having the producer "wait" until all consumers (read: connections) are done transmitting the data, you can always use ye olde trick of alternating output buffers.
Many ring buffer implementations allow direct splicing of a complete sequence of elements at once, (e.g. boost lockfree spsc_queue cache memory access). You could use such an operation to your advantage.
Also relevant:
TCP Zero copy using boost
It appears, that performance is a topic here.
Independent of whether boost::asio is used or some hand knitted solution, performance (throughput) might be down the the drain already by the fact (as stated in the comment section of the OP), that single bytes are being traded (read from the pipe).
After the initial "burst phase" when a consumer connects, single bytes trickle from the pipe to the connected consumer sockets with read() and write() operations per byte (or a few bytes, if the application is not constantly polling).
Given that (the fact that the price for system calls read() and write() is paid for small amounts of data), I dare theorize that anything about multiple queues or single queue etc. is already in the shadow of that basic "design flaw". I put "design flaw" in quotes as it cannot always be avoided to have to handle exactly such a situation.
So, if throughput cannot be optimized anyway, I would recommend the most simple and straightforward solution which can be conceived.
The "no threads" statement in the OP implies non-blocking file descriptors for both the accept socket, the consumer data sockets and the pipe. Will this be another 100% CPU/core eating polling application? If this is not some kind of special ops hyper-optimized problem, I would rather not advice to use non-blocking file descriptors. Also, I would not worry about zero-copy or not.
One easy approach with threads would be to have the consumer sockets non-blocking, while pipe is in blocking mode. The thread which reads the pipe then pumps the data into a queue and calls the function which services all currently connected consumers. The listen socket (the one calling accept()) is in signaled state, when new client connections are pending. With mechanisms like kqueue (bsd) or epoll (linux etc.) or WaitForMultipleObjects (windows), the pipe reader thread can react to that situation as well.
In the times when nothing is to be done, your application is sleeping/blocking and friendly to our environment :)
Many of you know the original "send()" will not write to the wire the amount of bytes you ask it to. Easily you can use a pointer and a loop to make sure your data is all sent.
However, I don't see how in WSASend() and completion ports work in this case. It returns immediately and you have no control over how much was sent (except in a lpLength which you have access in the routine). How does this get solved?
Do you have to call WSASend() in the routine multiple times in order the get all the data out? Doesn't this seem like a great disadvantage, especially if you want your data out in a particular order and multiple threads access the routines?
When you call WSASend with a socket that is associated with an IOCP and an OVERLAPPED structure you effectively pass off your data to the network stack to send. The network stack will give you a "completion" once the data buffer that you used is no longer required by the network stack. At that point you are free to reuse or release the memory used for your data buffer.
Note that the data is unlikely to have reached the peer at the point the completion is generated and the generation of the completion means nothing more than the network stack has taken ownership of the contents of the buffer.
This is different to how send operates. With send in blocking mode the call to send will block until the network stack has used all of the data that you have supplied. For calls to send in non-blocking mode the network stack takes as much data as it can from your buffer and then returns to you with details of how much it used; this means that some of your data has been used. With WSASend, generally, all of your data is used before you are notified.
It's possible for an overlapped WSASend to fail due to resource limits or network errors. It's unusual to get a failure which indicates that some data has been send but not all. Usually it's all sent OK or none sent at all. However it IS possible to get a completion with an error which indicates that some data has been used but not all. How you proceed from this point depends on the error (temporary resource limit or hard network fault) and how many other WSASends you have pending on that socket (zero or non-zero). You can only try and send the rest of the data if you have a temporary resource error and no other outstanding WSASend calls for this socket; and this is made more complicated by the fact that you don't know when the temporary resource limit situation will pass... If you ever have a temporary resource limit induced partial send and you DO have other WSASend calls pending then you should probably abort the connection as you may have garbled your data stream by sending part of the buffer from this WSASend call and then all (or part) of a subsequent WSASend call.
Note that it's a) useful and b) efficient to have multiple WSASend calls outstanding on a socket. It's the only way to keep the connection fully utilised. You should, however, be aware of the memory and resource usage implications of having multiple overlapped WSASend calls pending at one time (see here) as effectively you are handing control of the lifetime of your buffers (and thus the amount of memory and resources that your code uses) to the peer due to TCP flow control issues). See SIO_IDEAL_SEND_BACKLOG_QUERY and SIO_IDEAL_SEND_BACKLOG_CHANGE if you want to get really clever...
WSASend() on a completion port does not notify you until all of the requested data has been accepted by the socket, or until an error occurs, whichever happens first. It keeps working in the background until all of the data has been accepted (or errored). Until it notifies you, that buffer has to remain active in memory, but your code is free to move on to do other things while WSASend() is busy. There is no notification when the data is actually transmitted to the peer. IF you need that, then you have to implement an ACK in your data protocol so the peer can notify you when it receives the data.
First regarding send. Actually there may happen 2 different things, depending on how the socket is configured.
If socket is in so-called blocking mode (the default) - the call to send will block the calling thread, until all the input buffer is consumed by the underlying network driver. (Note that this doesn't mean that the data has already arrived at the peer).
If the socket is transferred to a non-blocking mode - the call to send will fail if the underlying driver may not consume all the input immediately. The GetLastError returns WSAEWOULDBLOCK in such a case. The application should wait until it may retry to send. Instead of calling send in a loop the application should get the notification from the system about the socket state change. Functions such as WSAEventSelect or WSAAsyncSelect may be used for this (as well as legacy select).
Now, with I/O completion ports and WSASend the story is somewhat different. When the socket is associated with the completion port - it's automatically transferred to a non-blocking mode.
If the call to WSASend can't be completed immediately (i.e. the network driver can't consume all the input) - the WSASend returns an error and GetLastError returns STATUS_PENDING. This actually means that an asynchronous operation has started but not finished yet**.
That is, you should not call WSASend repeatedly, because the send operation is already in the progress. When it's finished (either successfully or not) you'll get the notification on the I/O completion port, but meanwhile the calling thread is free to do other things.
I have a Client - Server architecture with 10 Servers with permanent connections with a single Client, the software is written in C++ and uses boost asio libraries.
All the connections are created in the initialization phase, and they are always open during the execution.
When the client needs some information, sends a request to all of the servers. Each server finds the information needed and answers to the client.
In the client there is a single thread that is in charge of receiving the messages from all of the sockets, in particular, I use only one io_services, and one async_read from each of the sockets.
When a message arrives in one of the sockets, the async_read read the first N bit that are the header of the message and than call a function that uses read (synchronous) to read the rest of the message. To the server side, the header and the rest of the message are sent with a single write (synchronous).
Then, the architecture works properly, but I noticed that sometimes the synchronous readtakes more time (~0.24 sec) than the usual.
In theory the data is ready to be read because the synchronous read is called when the async_read has already read the header. I also saw that if I use only one server instead of 10, this problem doesn't occur. Furthermore, I noticed that this problem is not caused because of the dimension of the message.
Is it possible that the problem occurs because the io_service is not able to handle all the 10 async_read? In particular, if all the sockets receive a message at the same time, could the io_service lost some time to manage the queues and slows down my synchronous read?
I haven't posted the code, because is difficult to estract it from the project, but if you don't understand my description I could write an example.
Thank you.
1) When async.read completion handler gets invoked, it doesn't mean that some data is available, it means that all the available to that moment data has already been read (unless you specified a restricting completion-condition). So the subsequent sync.read might wait until some more data arrives.
2) Blocking a completion handler is a bad idea, because you actually block all the other completion handlers and other functors posted to that io_service. Consider changing your design.
If you go for an asynchronous design, don't mix in some synchronous parts. Replace all your synchronous reads and writes with asynchronous ones. Both reads and writes will block your thread while the asynchronous variants will not.
Further, if you know the number of expected bytes exactly after reading the header you should request exactly that number of bytes.
If you don't know it, you could go for a single async_read_some with the size of the biggest message you expect. async_read_some will notify you how many bytes were actually read.