TCP server hangs, not responding to SYN - c++

I have a strange issue with a TCP server that sometimes hangs. The weird issue is that when it hangs it does not receive any new connection, i.e. doesn't respond to the initial TCP SYN packet. I was pretty sure that since TCP handshakes are handled by the kernel, even when a program hangs clients should still at the very least receive the initial SYN,ACK. If anyone knows a situation where a program can hang in a way that prevents the OS from even completing the TCP handshake (and without it ever closing the listening socket) please let me know.
P.S.
The program is written in C++ and the OS is Windows Server 2016.

Most likely, the listen queue is full. Not responding to the initial SYN causes the other side to try another SYN a bit later. With luck, the listen queue won't be full at that time. The program is probably not calling accept (or some similar function) often enough.
It's also possible that the program is using the selective accept functionality (see the lpfnCondition parameter to WSASelect) to choose not to respond to this connection attempt.

Related

Windows sockets: What can cause send() to stockpile silently?

I have a C++ program which uses Windows sockets to communicate in the LAN with another instance of the same program. This works in general. For a few customers it doesn't.
Socket creation, listen and accept succeed. The computer who created the socket can send() a message and the second computer will recv() it. However, when it sends a reply (using send(), successfully), this doesn't arrive at the first computer.
If I artificially increase the message size, send() reports to send the first 65564(!) bytes successfully, but when I call it again for the rest, it returns -1 and WSAGetLastError return 183 (ERROR_ALREADY_EXISTS). Again, nothing arrives at the first computer. Firewalls / AV programs are already turned off.
It seems to me as if send() is keeping the data in some kind of "outbox" without really sending it. What could cause such a thing?

Forced server-side socket close without SO_LINGER > 0 can lose data, right?

I'm writing a cross-platform client application that uses sockets, written in C++. I'm having problems where the server is doing a hard close on the socket when it's done sending me info.
I've been reading other posts on this topic, and I'm not so much interested in the rights or wrong of this approach, but it's seems the server is either explicitly setting SO_LINGER=0, or that's the default behavior on that system (not sure, it's a Linux box).
I can see (in Wireshark) that the data was sent to me followed within milli-seconds by an RST, indicating a hard close by the server. I personally don't agree with this approach as it should be up to the client to shutdown the socket.
Server team are saying there's nothing wrong with that approach (doing a hard close rather than shutdown), it's typical on servers to avoid accumulating TIMED_WAIT sockets. On Windows my select() returns indicating there's something to read (while I haven't read any of this "in transit" data yet).
However, because of the quick arrival of the RST, on Windows recv() returns -1 and I'm seeing a 10054 for the error code (connection reset by peer). This wouldn't be too bad if I could at least get the data that was sent, but it seems that once my client's socket stack sees the RST any unread bytes are no longer made available to me.
On Linux (client), there's no problem. It seems the TCP stack is behaving slightly differently, in that I can read the outstanding bytes before the RST is honoured. I'm having trouble convincing the server guys they have a bug, given that it works for a Linux client.
First off, am I correct? Is this a server-side issue? I can't see that the client end is doing anything wrong, so it must be right?
It seems the server team are adamant that they want to perform the close, and they don't want to in have TIMED_WAITs, so I was going to push for them to add a SO_LINGER of, say 2 seconds? Does that sound like it will solve my problem? From what I understand this will stop the server from sending out a RST so soon after sending data, and should give me a chance to read the outstanding bytes.
Found a definitive answer to my own question:
"...Upon reception of RST segment, the receiving side will immediately abort the connection. This statement has more implications than just meaning that you will not be able to receive or send any more data to/from this connection. It also implies that any unread data still in the TCP reception buffer will be lost..." It cites the book "TCP/IP Internetworking Volume II". I don't have that book, so I can only take his word for it. Doesn't seems to discard data on Linux, only Windows...
Olivier Langlois's blog
The side-effect of fiddling with SO_LINGER to force a reset is that all pending data is lost. The fact that you don't receive it is all the proof you need that the server team is wrong to do this.
RFC 793 cited below says 'this command [ABORT] causes all pending SENDs and RECEIVEs to be aborted, ... and a special RESET message to be sent to the TCP on the other side of the connection.' See also W.R. Stevens, TCP/IP Illustrated, Vol. 1, p. 287: 'Aborting a connection provides two features to the application: (1) any queued data is thrown away and the reset is sent immediately, and (2) the receiver of the RST can tell that the other end did an abort instead of a normal close'. There is similar wording, along with an extract from the BSD code that implements it, in Vol. 2.
The TIME_WAIT state only occurs on a socket which sends a FIN before it has received one: see RFC 793. So the server should be waiting for a FIN from the client, with a suitable timeout, rather than resetting. This will also permit the client to do connection pooling.

SIGPIPE not being generated immediately after 1st send

I want to know whether its possible for tcp socket to report any broken pipe error immediately. Currently i am catching the sigpipe signal at the client side when server goes down ... but i found that the sigpipe signal is generated
only after 2nd msg is sent from client to server . what could be the possible reason for this?? If the other socket end went down , then the 1st send must return sigpipe .. y isnt that signal generated immediately..??
Is there any possible explanation to this peculiar behaviour?? And any possible way to get around this??
The TCP stack will only throw an error after some number of retransmission attempts. IIRC, the TCP retransmission timer is initialized to some small number of seconds and the number of retransmissions is typically 5-10. The protocol does not support any other means of detecting a peer that has become unreachable during a data exchange, (ie. someone tripped over the server power cable).
I think using SO_KEEPALIVE option may speed up broken link detection.
I want to know whether its possible for tcp socket to report any broken pipe error immediately
The other end of the pipe is across a network. That network could be slow and unreliable. So one end of the pipe can never instantly tell whether its partner still there. The delay could be quite long, so the O/S is also likely to do some bufferring. These considerations make it practically impossible to immediately detect a broken pipe.
And any possible way to get around this
But why would you want to? The pipe could be broken at any time during trans mission, so you have to handle the general case anyway.

Socket still listening after application crash

I'm having a problem with one of my C++ applications on Windows 2008x64 (same app runs just fine on Windows 2003x64).
After a crash or even sometimes after a regular shutdown/restart cycle it has a problem using a socket on port 82 it needs to receive commands.
Looking at netstat I see the socket is still in listening state more than 10 minutes after the application stopped (the process is definitely not running anymore).
TCP 0.0.0.0:82 LISTENING
I tried setting the socket option to REUSEADDR but as far as I know that only affects re-connecting to a port that's in TIME_WAIT state. Either way this change didn't seem to make any difference.
int doReuse = 1;
setsockopt(listenFd, SOL_SOCKET, SO_REUSEADDR,
(const char *)&doReuse, sizeof(doReuse));
Any ideas what I can do to solve or at least avoid this problem?
EDIT:
Did netstat -an but this is all I am getting:
TCP 0.0.0.0:82 0.0.0.0:0 LISTENING
For netstat -anb I get:
TCP 0.0.0.0:82 0.0.0.0:0 LISTENING
[System]
I'm aware of shutting down gracefully, but even if the app crashes for some reason I still need to be able to restart it. The application in question uses an in-house library that internally uses Windows Sockets API.
EDIT:
Apparently there is no solution for this problem, so for development I will go with a proxy / tool to work around it. Thanks for all the suggestions, much appreciated.
If this is only hurting you at debug time, use tcpview from the sysinternals folks to force the socket closed. I am assuming it works on your platform, but I am not sure.
If you're doing blocking operations on any sockets, do not use an indefinite timeout. This can cause weird behavior on a multiprocessor machine in my experience. I'm not sure what Windows server OS it was, but, it was one or two versions previous to 2003 Server.
Instead of an indefinite timeout, use a 30 to 60 second timeout and then just repeat the wait. This goes for overlapped IO and IOCompletion ports as well, if you're using them.
If this is an app you're shipping for others to use, good luck. Windows can be a pure bastard when using sockets...
I tried setting the socket option to
REUSEADDR but as far as I know that
only affects re-connecting to a port
that's in TIME_WAIT state.
That's not quite correct. It will let you re-use a port in TIME_WAIT state for any purpose, i.e. listen or connect. But I agree it won't help with this. I'm surprised by the comment about the OS taking 10 minutes to detect the crashed listener. It should clean up all resources as soon as the process ends, other than ports in the TIME_WAIT state.
The first thing to check is that it really is your application listening on that port. Use:
netstat -anb
to figure out which process is listenin on that port.
The second thing to check is that your are closing the socket gracefully when your application shuts down. If you're using a high-level socket API that shouldn't be too much of an issue (you are using a socket API, right?).
Finally, how is your application structured? Is it threaded? Does it launch other processes? How do you know that your application is really shut down?
Run
netstat -ano
This will give you the PID of the process that has the port open. Check that process from the task manager. Make sure you have "list processes from all users" is checked.
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html is a great resource for "Bind: Address Already in Use" errors.
Some extracts:
TIME_WAIT is the state that typically ties up the port for several minutes after the process has completed. The length of the associated timeout varies on different operating systems, and may be dynamic on some operating systems, however typical values are in the range of one to four minutes.
Strategies for Avoidance
SO_REUSEADDR
This is the both the simplest and the most effective option for reducing the "address already in use" error.
Client Closes First
TIME_WAIT can be avoided if the remote end initiates the closure. So the server can avoid problems by letting the client close first.
Reduce Timeout
If (for whatever reason) neither of these options works for you, it may also be possible to shorten the timeout associated with TIME_WAIT.
After seeing https://superuser.com/a/453827/56937 I discovered that there was a WerFault process that was suspended.
It must have inherited the sockets from the non-existent process because killing it freed up my listening ports.

listening socket dies unexpectedly

I'm having a problem where a TCP socket is listening on a port, and has been working perfectly for a very long time - it's handled multiple connections, and seems to work flawlessly. However, occasionally when calling accept() to create a new connection the accept() call fails, and I get the following error string from the system:
10022: An invalid argument was supplied.
Apparently this can happen when you call accept() on a socket that is no longer listening, but I have not closed the socket myself, and have not been notified of any errors on that socket.
Can anyone think of any reasons why a listening socket would stop listening, or how the error mentioned above might be generated?
Some possibilities:
Some other part of your code overwrote the handle value. Check to see if it has changed (keep a copy somewhere else and compare, print it out, breakpoint on write in the debugger, whatever).
Something closed the handle.
Interactions with a buggy Winsock LSP.
One thing that comes to my mind is system standy or hibernation mode. I'm not sure how these events are handled by the winsock Library. Might be that the network interface is (partially) shut down.
It might make sense to debug the socket's thread (either with an IDE or through a disassembler) and watch its execution for anything that might be causing it to stop listening.