I wrote a server that is listening for incomming TCP connections and clients connecting to it. When I shut down the server and restart it on the same port, I sometimes get the error message EADDRINUSE when calling bind(...) (error code: 98 on Linux). This happens even though I am setting the option to reuse the socket.
The error does not happen all the time, but it seems that it happens more often when clients are connected to the server and sending data while it shuts down. I guess the problem is that there are still pending connections while the server is shut down (related topic: https://stackoverflow.com/questions/41602/how-to-forcibly-close-a-socket-in-time-wait).
On the server side, I am using boost::asio::ip::tcp::acceptor. I initialize it with the option "reuse_address" (see http://beta.boost.org/doc/libs/1_38_0/doc/html/boost_asio/reference/basic_socket_acceptor.html). Here is the code snippet:
using boost::asio::ip::tcp;
acceptor acceptor::acceptor(io_service);
endpoint ep(ip::tcp::v4(), port);
acceptor.open(ep.protocol());
acceptor.set_option(acceptor::reuse_address(true));
acceptor.bind(ep);
acceptor.listen();
The acceptor is closed with:
acceptor.close();
I also tried using acceptor.cancel() before that, but it had the same effect. When this error occurred, I cannot restart the server on the same port for quite some time. Restarting the network helps, but is not a permanent solution.
What am I missing?
Any help would be greatly appreciated! :)
These were originally a comment to the question.
does your server fork child processes? Also, are you sure the socket is in TIME_WAIT state? You might want to grab the netstat -ap output when this happens
When you solve these problems "by force", it seems you are calling problems on your head, do not you?
There is a reason the default behavior requires you to wait, otherwise the network could for example confuse the ACK from the previous connection to be ACK for the new connection.
I would not allow this "solution" to be included in release builds in my team.
Remember, when the probability of error is very low, testing is extremely difficult!
Related
I have a server application (unimrcpserver.exe) that is answering requests from client processes. This server process listens to several ports.
with netstat -a command I get the following lines for my process.
TCP 192.168.10.65:2544 MERTB-PC:0 LISTENING
TCP 192.168.10.65:2554 MERTB-PC:0 LISTENING
TCP 192.168.10.65:9060 MERTB-PC:0 LISTENING
(netstat output is long I only put relevant lines here)
Normally when the system works I make requests to the server from these ports and each of them works fine.
When I was doing stress tests I saw a situation where the system no longer responded the my requests that I make through the port 2554.
netstat -a still gives me the above lines so the server is somehow still listening to this port. When I run telnet on the same machine it gives an error :
telnet 192.168.10.65 2554
Connecting To 192.168.10.65...Could not open connection to the host, on port 2554: Connect failed
I also wrote a simple program with c++ to get the exact error message that the system generates to a connect() request. This time I get the following error:
No connection could be made because the target machine actively refused it
Additional info: Everything is on the same Windows machine. Firewall is disabled. This situation occurred only once when I am doing stress tests that sends multiple requests at the same time. Before the situation occurred the system handled to around 13000 requests, which took around half an hour.
So the question is : How can this situation occur? The port is being reported as "LISTENING" with netstat but I cannot connect to it. If it can be caused by a programming error what kind of an error can cause this kind of behavior?
A new connection can be "actively refused" under several conditions:
there is no LISTENING socket on the IP:Port being connected to.
there is a LISTENING socket, but its backlog of pending connections is full, so it cannot accept a new connection at that moment.
A firewall is blocking it. Though the firewall is more likely to use a different error, if it sends an error at all.
Since there is a LISTENING socket, #2 is the most likely/common case. If so, it means the server app is not accepting clients from its backlog fast enough, if at all.
A client cannot differentiate between these conditions. All it can do is detect the connect failure - WSAECONNREFUSED or ECONNREFUSED, depending on platform - and try again later.
So the question is : How can this situation occur? The port is being reported >>as "LISTENING" with netstat but I cannot connect to it. If it can be caused >>by a programming error what kind of an error can cause this kind of behavior?
Yes,It could be caused by a programming error on the server. I have seen it happening when the server's listening thread is deadlocked. The socket's state is "listening" but if the listening thread has some global state and is blocked on other threads waiting on a mutex to be released you will encounter this.
Also, like others here stated if the CPU is loaded due to your stress test and that might cause the server to refuse connections since the threads might be busy processing and the listening thread never got a chance to accept the connection.
I have a situation where the server that the client connects to may get repeatly shutdown with the client still operational.
In the current implementation, when the client fails a read it will call close(sockFd) to close the socket. Then it will it loop to try to recreate that socket.
Is that best practice? Or is it possible to leave the socket and attempt to connect to it?
Edit: Platform is Linux
When you get any error other than EINTR or EAGAIN/EWOULDBLOCK on a socket it is almost certainly dead and must be closed. #abarnert gives some others in the useful comment below.
Ok let me be clear. I'm using TCP and that should mean a connection shouldn't interrupt unless closed or due to network problems.
So here's my issue:
Utilizing my sockets works perfectly.
After 5 - 10 min of innactivity they stop responding (the connection is still alive [checked with netstat -n]).
It tells me that data is send (but the other side doesn't receive it and I'm sure it waiting for it.)
If I keep sending, eventually it will give me WSA error 10038 (invalid socket handle).
EDIT after a few more tries of sending, it gave me error 10058 (An established connection was aborted by the software in your host machine. )
I'm confused completely. I haven't closed the socket nor done anything to it other than inactivity. If I use it nonstop for 10 - 20 minutes, it works perfectly.
With error 10058, it's practically certain that a gateway (a proxy, or a firewall, or a router, with or without NAT) is timing out its relay of your connection.
Basically, you are not directly connected with your peer. Instead, the gateway is in between, and explicitly transfering data between its connection with you and its connection with your peer. Since sockets are a limited resource, the gateway has an eviction policy where it shuts down what look like inactive connections. If you look dead, boom, you are dead.
Your only option is to remain active, which typically means working in some kind of "heartbeat" into your application protocol. Nasty, but them's the breaks.
Unless you really know what you are doing, do not play around with TCP's SO_KEEPALIVE.
A NAT firewall may be eating your connection without telling you. Try enabling TCP keepalive.
i am developing client server application in windows using c++ and winsock lib it work fine but if it is on network and once server listening started and if i remove network cable then server doesn't shows any error in any thread so where server socket knows network cable is unplugged.
if any body knows please help me.
While it should be possible to detect that the network cable is unplugged on the host, you will still have the same problem if the network is disrupted somewhere else between your server and the clients.
One common (if not the most common) way to solve this is to have a "keep-alive" message being sent. If no reply to that message is received within some timeout you simply close the connection and release all resources associated with it.
Edit
A "keep-alive" message is like using the "ping" command to see if a remote machine can be reached. It is simply a message that is sent, either by the server or the client (it doesn't matter who initiate it) to see if the other end of the connection is alive and can be reached.
It can be as simple as sending the string "Are you there?" and expecting a reply containing "Yes I am". If you send it once every minute, and don't get a reply withing (for example) one minute, you can consider the connection being dead. The other end, that receives the "Are you there?", knows it will get the message once every minute. If it hasn't arrived for two minutes then the sender is no longer reachable.
If the protocol can't be modified to add such messages, then see if some other message can be used instead.
Also, remember that the best and some cases only way to know if something is wrong with a connection is to attempt to read from the socket.
You can unplug a network and then plug it back in, or your Wi-Fi laptop can lose reception for a second and then pick it back up. It would be frustrating if such resumable cases were treated as an error in all the programs we use.
From this Winsock "newbie" FAQ:
The previous question deals with detecting when a protocol connection is dropped normally, but what if you want to detect other problems, like unplugged network cables or crashed workstations? In these cases, the failure prevents notifying the remote peer that something is wrong. My feeling is that this is usually a feature, because the broken component might get fixed before anyone notices, so why demand that the connection be reestablished?
If you feel you have a "special needs" situation you can be aggressive with timeouts. But I wouldn't do that unless there was a really good reason.
System Background:
Its basically a client/server application. Server is an embedded device and Client is a windows app developed in C++.
Issue: After a runtime of about a week, communication breaks between client/server,
because of this the server is not able to connect back to the client and needs a restart to recover. Looks like System is experiencing Socket re-connection problem. Also The network sometimes experiences intermittent failures.
Abrupt Termination at remote end
Port locking
Want some suggestions on how to cleanup the socket or shutdown cleanly so that re-connection happens properly. Other alternate solutions?
Thanks,
Hussain
It does not sound like you are in a position to easily write a stress test app to reproduce this more quickly out of band, which is what I would normally suggest. A pragmatic solution might be to periodically restart the server and client at a time when you think the system is least busy, or when problems arise. This sounds like cheating but many production systems I have been involved with take this approach to maximize system uptime.
My preferred solution here would be to abstract the server and client socket code (hopefully your design allows this to be done without too much work) and use it to implement client and server test apps that can be used to stress test only the socket code by simulating a lot of normal socket traffic in a short space of time - this helps identify timing windows and edge cases that could cause problems over time, and might speed up the process of obtaining a debuggable repro - you can simulate network error in your test code by dropping the socket on the client or server periodically.
A further step to take on the strategic front would be to ensure that you have good diagnostics in your socket handlers on client and server side. Track socket open and close, with special focus on your socket error and reconnect paths given you know the network is unreliable. Make sure the logs are output sequential with a timestamp. Something as simple as this might quickly show you what error or conditions trigger your problems. You can quickly make sure the logs are correct and complete using the test apps I mentioned above.
One thing you might want to check is that you are not being hit by lack of ability to reuse addresses. Sometimes when a socket gets closed, it cannot be immediately reused for a reconnect attempt as there is still residual activity on one or other end. You may be able to get around this (based on my Windows/Winsock experience) by experimenting with SO_REUSEADDR and SO_LINGER on your sockets. however, my first focus in your case would be on ensuring the socket code on client and server handles all errors and mainline cases correctly, before worrying about this.
A common issue is that when a connection is dropped, it is kept opened by the OS in TIME_WAIT state. If you want to restart the server socket, it will not be able to reopen the same port directly because it is still present for the OS.
To avoid that, you need to set the parameter SO_REUSEADDR so that the OS allows you to reuse the port if it is in TIME_WAIT state for a server socket.
Example:
int optval=1;
// set SO_REUSEADDR on a socket to true (1):
setsockopt(s1, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof optval);
I'm experiencing something similar with encrypted connections. I believe in my case it is because the client dropped the connection and reconnected in less than the 4 minute FIN_WAIT period. The initial connection is recycled (by the os) and the server doesn't see the drop out. The SSL authentication is lost when the client loses connection so the client tries to re-authenticate. This is during what the servers considers the middle of a conversation. The server then hangs up on the client. I think the server ssl code considers this a man in the middle attack or just gets confused and closes the connection.