I'm having a problem with one of my C++ applications on Windows 2008x64 (same app runs just fine on Windows 2003x64).
After a crash or even sometimes after a regular shutdown/restart cycle it has a problem using a socket on port 82 it needs to receive commands.
Looking at netstat I see the socket is still in listening state more than 10 minutes after the application stopped (the process is definitely not running anymore).
TCP 0.0.0.0:82 LISTENING
I tried setting the socket option to REUSEADDR but as far as I know that only affects re-connecting to a port that's in TIME_WAIT state. Either way this change didn't seem to make any difference.
int doReuse = 1;
setsockopt(listenFd, SOL_SOCKET, SO_REUSEADDR,
(const char *)&doReuse, sizeof(doReuse));
Any ideas what I can do to solve or at least avoid this problem?
EDIT:
Did netstat -an but this is all I am getting:
TCP 0.0.0.0:82 0.0.0.0:0 LISTENING
For netstat -anb I get:
TCP 0.0.0.0:82 0.0.0.0:0 LISTENING
[System]
I'm aware of shutting down gracefully, but even if the app crashes for some reason I still need to be able to restart it. The application in question uses an in-house library that internally uses Windows Sockets API.
EDIT:
Apparently there is no solution for this problem, so for development I will go with a proxy / tool to work around it. Thanks for all the suggestions, much appreciated.
If this is only hurting you at debug time, use tcpview from the sysinternals folks to force the socket closed. I am assuming it works on your platform, but I am not sure.
If you're doing blocking operations on any sockets, do not use an indefinite timeout. This can cause weird behavior on a multiprocessor machine in my experience. I'm not sure what Windows server OS it was, but, it was one or two versions previous to 2003 Server.
Instead of an indefinite timeout, use a 30 to 60 second timeout and then just repeat the wait. This goes for overlapped IO and IOCompletion ports as well, if you're using them.
If this is an app you're shipping for others to use, good luck. Windows can be a pure bastard when using sockets...
I tried setting the socket option to
REUSEADDR but as far as I know that
only affects re-connecting to a port
that's in TIME_WAIT state.
That's not quite correct. It will let you re-use a port in TIME_WAIT state for any purpose, i.e. listen or connect. But I agree it won't help with this. I'm surprised by the comment about the OS taking 10 minutes to detect the crashed listener. It should clean up all resources as soon as the process ends, other than ports in the TIME_WAIT state.
The first thing to check is that it really is your application listening on that port. Use:
netstat -anb
to figure out which process is listenin on that port.
The second thing to check is that your are closing the socket gracefully when your application shuts down. If you're using a high-level socket API that shouldn't be too much of an issue (you are using a socket API, right?).
Finally, how is your application structured? Is it threaded? Does it launch other processes? How do you know that your application is really shut down?
Run
netstat -ano
This will give you the PID of the process that has the port open. Check that process from the task manager. Make sure you have "list processes from all users" is checked.
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html is a great resource for "Bind: Address Already in Use" errors.
Some extracts:
TIME_WAIT is the state that typically ties up the port for several minutes after the process has completed. The length of the associated timeout varies on different operating systems, and may be dynamic on some operating systems, however typical values are in the range of one to four minutes.
Strategies for Avoidance
SO_REUSEADDR
This is the both the simplest and the most effective option for reducing the "address already in use" error.
Client Closes First
TIME_WAIT can be avoided if the remote end initiates the closure. So the server can avoid problems by letting the client close first.
Reduce Timeout
If (for whatever reason) neither of these options works for you, it may also be possible to shorten the timeout associated with TIME_WAIT.
After seeing https://superuser.com/a/453827/56937 I discovered that there was a WerFault process that was suspended.
It must have inherited the sockets from the non-existent process because killing it freed up my listening ports.
Related
I have a strange issue with a TCP server that sometimes hangs. The weird issue is that when it hangs it does not receive any new connection, i.e. doesn't respond to the initial TCP SYN packet. I was pretty sure that since TCP handshakes are handled by the kernel, even when a program hangs clients should still at the very least receive the initial SYN,ACK. If anyone knows a situation where a program can hang in a way that prevents the OS from even completing the TCP handshake (and without it ever closing the listening socket) please let me know.
P.S.
The program is written in C++ and the OS is Windows Server 2016.
Most likely, the listen queue is full. Not responding to the initial SYN causes the other side to try another SYN a bit later. With luck, the listen queue won't be full at that time. The program is probably not calling accept (or some similar function) often enough.
It's also possible that the program is using the selective accept functionality (see the lpfnCondition parameter to WSASelect) to choose not to respond to this connection attempt.
I'm wondering what common programming situations/bugs might cause a server process I have enter into CLOSE_WAIT but not actually close the socket.
What I'm wanting to do is trigger this situation so that I can fix it. In a normal development environment I've not been able to trigger it, but the same code used on a live server is occasionally getting them so that after many many days we have hundreds of them.
Googling for close_wait and it actually seems to be a very common problem, even in mature and supposedly well written services like nginx.
CLOSE_WAIT is basically when the remote end shut down the socket but the local application has not yet invoked a close() on it. This is usually happens when you are not expecting to read data from the socket and thus aren't watching it for readability.
Many applications for convenience sake will always monitor a socket for readability to detect a close.
A scenario to try out is this:
Peer sends 2k of data and immediately closes the data
Your socket is then registered with epoll and gets a notification for readability
Your application only reads 1k of data
You stop monitoring the socket for readability
(I'm not sure if edge-triggered epoll will end up delivering the shutdown event as a separate event).
See also:
(from man epoll_ctl)
EPOLLRDHUP (since Linux 2.6.17)
Stream socket peer closed connection, or shut down writing half of connection. (This flag is especially useful for writing
simple code
to detect peer shutdown when using Edge Triggered monitoring.)
I use the function JNI_CreateJavaVM to create, on windows, an in-process JVM into an executable. I also pass some options to turn on debugging, in particular:
-Xrunjdwp:transport=dt_socket,server=y,address=666,suspend=n
along with some other options. This works fine.
The problem I have though is if I start two versions of the executable: JNI_CreateJavaVM crashes. The obvious fix is to check if port 666 is busy and, if so, use a different port number. Except I don't know how to do this!
So, my question: how can I check if a port is being listened on already?
Don't test for port-in-use in advance. Just let your application do what it wants to do, and catch the exception there. The situation can change between the time you test and when you actually try to use the port.
Furthermore, if you are trying to write a server application, you can have the operating system automatically pick a free port for you. From the JavaDoc for http://docs.oracle.com/javase/7/docs/api/java/net/ServerSocket.html
Creates a server socket, bound to the specified port. A port number of 0 means that the port number is automatically allocated, typically from an ephemeral port range. This port number can then be retrieved by calling getLocalPort.
The software I'm working on needs to be able to connect to many servers in a short period of time, using TCP/IP. The software runs under Win32. If a server does not respond, I want to be able to quickly continue with the next server in the list.
Sometimes when a remote server does not respond, I get a connection timeout error after roughly 20 seconds. Often the timeout comes quicker.
My problem is that these 20 seconds hurts the performance of my software, and I would like my software to give up sooner (after say 5 seconds). I assume that the TCP/IP stack (?) in Windows automatically adjusts the timeout based on some parameters?
Is it sane to override this timeout in my application, and close the socket if I'm unable to connect within X seconds?
(It's probably irrelevant, but the app is built using C++ and uses I/O completion ports for asynchronous network communication)
If you use IO completion ports and async operations, why do you need to wait for a connect to complete before continuing with the next server on the list? Use ConnectEx and pass in an overlapped structure. This way the individual server connect time will no add up, the total connect time is the max server connect time not the sum.
On Linux you can
int syncnt = 1;
int syncnt_sz = sizeof(syncnt);
setsockopt(sockfd, IPPROTO_TCP, TCP_SYNCNT, &syncnt, syncnt_sz);
to reduce (or increase) the number of SYN retries per connect per socket. Unfortunately, it's not portable to Windows.
As for your proposed solution: closing a socket while it is still in connecting state should be fine, and it's probably the easiest way. But since it sounds like you're already using asynchronous completions, can you simply try to open four connections at a time? If all four time out, at least it will only take 20 seconds instead of 80.
All configurable TCP/IP parameters for Windows are here
See TcpMaxConnectRetransmissions
You might consider trying to open many connections at once (each with its own socket), and then work with the one that responds first. The others can be closed.
You could do this with non-blocking open calls, or with blocking calls and threads. Then the lag waiting for a connection to open shouldn't be any more than is minimally nessecary.
You have to be careful when you override the socket timeout. If you are too aggressive and attempt to connect to many servers very quickly then the windows TCP/IP stack will assume your application is an internet worm and throttle it down. If this happens, then the performance of your application will become even worse.
The details of when exactly the throttling back occurs is not advertised, but the timeout you propose ( 5 seconds ) should be OK, in my experience.
The details that are available about this can be found here
I'm having a problem where a TCP socket is listening on a port, and has been working perfectly for a very long time - it's handled multiple connections, and seems to work flawlessly. However, occasionally when calling accept() to create a new connection the accept() call fails, and I get the following error string from the system:
10022: An invalid argument was supplied.
Apparently this can happen when you call accept() on a socket that is no longer listening, but I have not closed the socket myself, and have not been notified of any errors on that socket.
Can anyone think of any reasons why a listening socket would stop listening, or how the error mentioned above might be generated?
Some possibilities:
Some other part of your code overwrote the handle value. Check to see if it has changed (keep a copy somewhere else and compare, print it out, breakpoint on write in the debugger, whatever).
Something closed the handle.
Interactions with a buggy Winsock LSP.
One thing that comes to my mind is system standy or hibernation mode. I'm not sure how these events are handled by the winsock Library. Might be that the network interface is (partially) shut down.
It might make sense to debug the socket's thread (either with an IDE or through a disassembler) and watch its execution for anything that might be causing it to stop listening.