When is boost::asio::error::no_buffer_space thrown? - c++

I have an application that leverages boost::asio to accept new connections. The application uses async_accept to take anything and further process. However, I notice that in certain instance types, I'm consistently running into boost::asio::error::no_buffer_space while accepting new connections. Under which circumstances is this thrown?
I've tried the following so far:
Increasing the listen backlog for the application. This is shown to take affect after ss -ntlp which shown the new backlog size.
Increasing net.ipv4.tcp_max_syn_backlog
Increasing net.core.somaxconn
Nonetheless, I'm consistently getting boost::asio::error::no_buffer_space thrown while trying to accept a burst of connections. Under what additional circumstances would I get this, and how can I possibly mitigate it? Thank you.

Related

What is the proper mechanism for handling TCP failure?

I am writing a socket program in c++. The program runs on a set of cluster machines.
I just entered into the socket programming and just learned how to send and receive. I think that, during the long running of the program, some TCP connections can get lost. In that case, re-connecting the server and client smoothly is necessary.
I wonder if there is a well-known basic mechanism (or algorithm? protocol?) to achieve it. I found that there are many many socket error codes with different semantics, which makes me hard to start.
Can any one suggest any reference code that I can learn from?
Thanks,
It's not complicated. The only two error codes that aren't fatal to the connection are:
EAGAIN/EWOULDBLOCK, which are in fact two names for the same number, and mean that it is OK to re-attempt the operation after a period, or after select()/poll()/epoll() has so indicated;
EINTR, which just means 'interrupted system call' - try again.
All others are fatal to the connection and should cause to you close it.
The actual, specific error code, is irrelevant. If you have an active socket connection, a failed read or a write indicates that the connection is gone. The error code perhaps gives you some explanation, but it's a bit too late now. The socket is gone. It is no more. It ceased to exist. It's an ex-socket. You can use the error code to come up with a colorful explanation, but it would be little more than some minor consolation. No matter what was the specific reason, but your socket is gone and you have to deal with it.
When using non-blocking sockets there are certain specific return codes and errno values that indicate that the socket is still fine, but just is not ready to read or write anything, that you'll have to specifically check for, and handle. This would be the only exception to this.
Also, EINTR usually does not necessarily mean that the socket is really broken; so that might be another exception to check for.
Once you have a broken socket, the only general design principle, if there is one, is that you have to close() it as the first order of business. The file descriptor is completely useless. After that point, it's entirely up to you what to do next. There are no rules, etched in stone, for this situation. Typically, applications would log an error, in some form or fashion, or attempt to make another connection. It's generally up to you to figure out what to do.
About the only "well-known basic mechanism" in socket programming is explicit timeouts. Network errors, and failures, don't always get immediately detected by the underlying operating system. When a networking problem occurs, it is not always immediately detectable. It can take many minutes before the protocol stack declares a broken socket, and gives you an error indication.
So, if you're coding a particular application, and you know that you should expect to read or write something within some prescribed time frame, a common design pattern is to code an explicit timeout, and if nothing happens when the timeout expires, assume that the socket is broken -- even if you have no explicit error indication otherwise -- close() it, then proceed to the next step.

Out of Process COM Server - One server process per calling process?

I have an out of process com server, specifying CLSCTX_LOCAL_SERVER as the context, and REGCLS_MULTIPLEUSE for the connection type. This results in a single server process being reused by multiple calls from multiple clients.
I’m now wanting to make some changes to the server, which unfortunately can not work with a single process shared amongst clients (there are reasons for this, but they're long winded). I know you can set the server to use REGCLS_SINGLEUSE as the connection type, and this will create a new process for the OOP server each call. This solves my issue, but is a non-starter in terms of process usage; multiple calls over short periods result in many processes and this particular server might be hit incredibly often.
Does anyone happen to know of a mechanism to mix those two connection types? Essentially what I want is a single server process per calling process. (ie, client one creates a process, and that process is reused for subsequent calls from that client. Client two tries to call the server, and a new process is created). I suspect I could achieve it by forcing a REGCLS_SINGLEUSE server to stay open permanently in the client, but this is neither elegant nor possible (since I can’t change one of the clients).
Thoughts?
UPDATE
As expected, it seems there is no way to do this. If time and resource permitted I would most likely convert this to an In-Proc solution. For now though, I'm having to go with the new behaviour being used for any calling client. Fortunately, the impact of this change is incredibly small, and acceptable by the clients. I'll look into more drastic and appropriate changes later.
NOTE
I've marked Hans' reply as the answer, as it does in fact give a solution to the problem which maintains the OOP solution. I merely don't have capacity to implement it.
cal
COM does not support this activation scenario. It is supposed to be covered by an in-process server, do make sure that isn't the way you want to do it given its rather major advantages.
Using REGCLS_SINGLEUSE is the alternative, but this requires you extending your object model to avoid the storm of server instances you now create. The Application coclass is the boilerplate approach. Provide it with factory methods that gives you instances to your existing interfaces.
I'll mention a drastically different approach, one I used when I wanted to solve the same problem as well but required an out-of-process server to take advantage of bridging a bitness gap. You are not stuck with COM launching the server process for you, a client can start it as well. Provided of course that it knows enough about the server's installation location. Now a client of course has complete control over the server instance. The server called CoRegisterClassObject() with an altered CLSID, I xored part of the guid with the process ID. The client did the same so it always connected with the correct server. Extra code was required in the client to ensure it waits long enough to give the server a chance to register its object factories. Worked well.

What does it mean for TCP connections to churn?

In the context of webservices, I've seen the term "TCP connection churn" used. Specifically Twitter finagle has ways to avoid it happening. How does it happen? What does it mean?
There might be multiple uses for this term, but I've always seen it used in cases where many TCP connections are being made in a very short space of time, causing performance issues on the client and potentially the server as well.
This often occurs when client code is written which automatically connects on a TCP failure of any sort. If this failure happens to be a connection failure before the connection is even made (or very early on in the protocol exchange) then the client can go into a near-busy loop constantly making connections. This can cause performance issues on the client side - firstly that there is a process in a very busy loop sucking up CPU cycles, and secondly that each connection attempt consumes a client-side port number - if this goes fast enough the software can wrap around when they hit the maximum port number (as a port is only a 16-bit number this certainly isn't impossible).
While writing robust code is a worthy aim, this simple "automatic retry" approach is a little too naive. You can see similar problems in other contexts - e.g. a parent process continually restarting a child process which immediately crashes. One common mechanism to avoid it is some sort of increasing back-off. So, when the first connection fails you immediately reconnect. If it fails again within a short time (e.g. 30 seconds) then you wait, say, 2 seconds before reconnecting. If it fails again within 30 seconds, you wait 4 seconds, and so on. Read the Wikipedia article on exponential backoff (or this blog post might be more appropriate for this application) for more background on this technique.
This approach has the advantage that it doesn't overwhelm the client or server, but it also means the client can still recover without manual intervention (which is especially crucial for software on an unattended server, for example, or in large clusters).
In cases where recovery time is critical, simple rate-limiting of TCP connection creation is also quite possible - perhaps no more than 1 per second or something. If there are many clients per server, however, this more simplistic approach can still leave the server's swamped by the load of accepting then closing a high connection rate.
One thing to note if you plan to employ exponential backoff - I suggest imposing a maximum wait time, or you might find that prolonged failures leave a client taking too long to recover once the server end does start accepting connections again. I would suggest something like 5 minutes as a reasonable maximum in most circumstances, but of course it depends on the application.

Writing a server application that Pushes to clients (TCP)

I'm writing a client-server application and one of the requirements is the Server, upon receiving an update from one of the clients, be able to Push out new data to all the other clients. This is a C++ (Qt) application meant to run on Linux (both client and server), but I'm more looking for high-level conceptual ideas of how this should work (though low-level thoughts are good, too).
Server:
It needs to (among its other duties) keep a socket open listening for incoming packets from potentially n different clients, presumably on a background thread (I haven't written much in terms of socket code other than some rinky-dink examples in school). Upon getting this data from a client, it processes it and then spits it out to all its clients, right?
Of course, I'm not sure how it actually does this. I'm guessing this means it has to keep persistent connections with every single client (at least the active clients), but I don't understand even conceptually how to maintain this connection (or the list of these connections).
So, how should I approach this?
In general when you have multiple clients, there are a few ways to handle this.
First of all, in TCP, when a client connects to you they're placed into a queue until they can be serviced. This is a given, you don't need to do anything except call the accept system call to receive a new client. Once the client is recieved, you'll be given a socket which you use to read and write. Who reads / writes first is entirely dependent on your protocol, but both sides need to know the protocol (which is up to you to define).
Once you've got the socket, you can do a few things. In a simple case, you just read some data, process it, write back to the socket, close the socket, and serve the next client. Unfortunately this means you can only serve one client at a time, thus no "push" updates are possible. Another strategy is to keep a list of all the open sockets. Any "updates" simply iterate over the list and write to each socket. This may present a problem though because it only allows push updates (if a client sent a request, who would be watching for it?)
The more advanced approach is to assign one thread to each socket. In this scenario, each time a socket is created, you spin up a new thread whose whole purpose is to serve exactly one client. This cuts down on latency and utilizes multiple cores (if available), but is far more difficult to program. Also if you have 10,000 clients connecting, that's 10,000 threads which gets to be too much. Pushing an update to a single client (in this scenario) is very simple (a thread just writes to its respective socket). Pushing to all of them at once is a little more tricky (requires either a thread event or a producer / consumer queue, neither of which are very fun to implement)
There are, of course, a million other ways to handle this (one process per client, a thread pool, a load-balancing proxy, you name it). Suffice it to say there's no way to cover all of these in one answer. I hope this answers your basic questions, let me know if you need me to clarify anything. It's a very large subject. However if I might make a suggestion, handling multiple clients is a wheel that has been re-invented a million times. There are very good libraries out there that are far more efficient and programmer-friendly than raw socket IO. I suggest libevent, which turns network requests into an event-driven paradigm (much more like GUI programming, which might be nice for you), and is incredibly efficient.
From what I understand, I think you need to keep an infinite loop going, (at least until the program terminates) that answers a connection request from your clients. It would be best to add them to a array of some sort. Use an event to see when a new client is added to that array, and wait for one of them to give data. Then you do what you have to do with that data and spit it back.

How do I determine a maximum time needed for TCP socket to die due to intermediate network disconnect?

I have a program in C++, using the standard socket API, running on Ubuntu 7.04, that holds open a socket to a server. My system lives behind a router. I want to figure out how long it could take to get a socket error once my program starts sending AFTER the router is cut off from the net.
That is, my program may go idle (waiting for the user). The router is disconnected from the internet, and then my program tries to communicate over that socket.
Obviously it's not going to know quickly, because TCP is quite adept at keeping a socket alive under adverse network conditions. This causes TCP to retry a lot of times, a lot of ways, before it finally gives up.
I need to establish some kind of 'worst case' time that I can give to the QA group (and the customer), so that they can test that my code goes into a proper offline state.
(for reference, my program is part of a pay at pump system for gas stations, and the server is the system that authorizes payment transactions. It's entirely possible for the station to be cut off from the net for a variety of reasons, and the customer just wants to know what to expect).
EDIT: I wasn't clear. There's no human being waiting on this thing, this is just for a back office notation of system offline. When the auth doesn't come back in 30 seconds, the transaction is over and the people are going off to do other things.
EDIT: I've come to the conclusion that the question isn't really answerable in the general case. The number of factors involved in determining how long a TCP connection takes to error out due to a downstream failure is too dependent on the exact equipment and failure for there to be a simple answer.
You should be able to use:
http://linux.die.net/man/2/getsockopt
with:
SO_RCVTIMEO and SO_SNDTIMEO
to determine the timeouts involved.
This link: http://linux.die.net/man/7/socket
talks about more options that may be of interest to you.
In my experience, just picking a time is usually a bad idea. Even when it sounds reasonable, arbitrary timeouts usually misbehave in practice. The general result is that the application becomes unusable when the environment falls outside of the norm.
Especially for financial transactions, this should be avoided. Perhaps providing a cancel button and some indication that the transaction is taking longer than expected would be a better solution.
I would twist the question around the other way: how long is a till operator prepared to stand there looking stupid in front of the customer before they say, of it must not be working lets to this the manual way.
So pick some time like 1 minute (assuming your network is not auto disconnect, and thus will reconnect when traffic occurs)
Then use that time for how long your program waits before giving up. Closing the socket etc. Displays error message. Maybe even a count down timer while waiting, so the till operator has an idea how much long the system is going to wait...
Then they know the transaction failed, and that it's manual time.
Otherwise depending on you IP stack, the worse case time-out could be 'never times-out'.
I think the best approach is not to try and determine the timeout being used, but to actually specify the timeout yourself.
Depending on your OS you can either:-
use setsockopt() with option SO_SNDTIMEO,
use non-blocking send() and then use select() with a timeout
use non-blocking send(), and have a timeout on receiving the expected data.