socket not released after service restart - c++

There is:
A: program that holds open socket
B: watch dog script running as service :
while true
do
if [ -z "`pidofproc $1`" ]; then
$1;
chrt -f -p 40 `pidofproc $1`
sleep 8
fi;
sleep 2
done
when service started - watch dog started
when service stopped - watch dog and program are killed (killall).
now program wants to upgrade itself, so it calls system( "upgrade.sh" );
upgrade.sh:
/sbin/service watchdog stop
.... install upgrade .....
exec /sbin/service watchdog start &
upgrade performed successfully, but when program starts - can't open socket (already in use) - on this error - program quits (to be restarted by watch dog).
lsof -i shows three programs on the port:
watchdog
program
sleep
program and sleep pids always change (i.e. quit/restart behavior)
watchdog pid persistant.
i tried to replace system(...), with
if(!fork()) exec(...) , but same problem remains.

Depending on how fast the restart happens after the shut down, the socket will be lingering around. Linux, by default, keeps sockets marked as in use for some time after they've been released (either by close() or when the process dies) to make sure that incoming connection attempts or data which is late due to network latencies won't end up at the wrong application.
This has to be fixed inside the application. It is required to set the SO_REUSEADDR sockopt. As per the manpage of socket(7):
Indicates that the rules used in validating addresses supplied
in a bind(2) call should allow reuse of local addresses. For
AF_INET sockets this means that a socket may bind, except when
there is an active listening socket bound to the address. When
the listening socket is bound to INADDR_ANY with a specific port
then it is not possible to bind to this port for any local
address. Argument is an integer boolean flag.
This has to be set using setsockopt after the socket was created.

Related

How I can check if a port is in TIME_WAIT status or already taken by another application

I developed a service which contains a tcp server.
When I restart my application, sometime my application is unable to bind the port for the TIME_WAIT reason.
In my application I want to add a procedure when the bind fails. this procedure should check at the beginning the type of the bind fails:
If the reason is due to the TIME_WAIT, then wait amoment and
retry again.
If the reason is due to that the port is taken by another application, then choose another port and retry the bind
How I can know the type of the bind fails?
NOTE:
I do not want to use SO_REUSEADDR
the bind errno is the same for both type of fails
Firstly, I would use the SO_REUSEADDR sockopt on your listening socket to avoid this situation in the first place. SO_REUSEADDR will allow you to reuse the same socket your TCP server was using previously upon restart because it is still being held open by the operating system.
Secondly, error handling is always a good idea. I encourage you to check the return value of the bind and handle the most likely errno you expect to encounter. You can get an errno list for bind in section 2 of the man page.
$> man -s 2 bind
Finally, the errno for a socket in TIME_WAIT is the same errno as an address in use: EADDRINUSE. This is because a socket in TIME_WAIT is in use by the OS.

Program doesn't stop when "ctrl c" is used to terminate program

I have a programme (C++ compiled) that using tcp socket to communicated. The programme is configured in two mode. Let's say mode A and mode B.
Start the programme mode A, it will give some prints like:
waiting connections on port 1234
local endpoint : 0.0.0.0:1234
//I think it is using boost for TCP socket
Then start mode B. They will find each other and run perfect.
Question is if I start mode A, and then use "ctrl c" to terminate the application with mode A. It will left the port open there.
When I start the mode B, it will also find the connection and runs with error due to A is not there.
I have a bash to run the application, I want to ask how I can force that port to close? (In bash or other possible way)
Thanks
Use this in the bash script (before calling your binary):
trap "fuser -k -n tcp 1234 && exit" SIGINT SIGTERM

Program works in foreground, but not in background using nohup

I am currently writing a server backend for my iOS game. The server is written in C++ and compiled & run on a remove Ubuntu Server. I start the server through SSH using
sudo nohup ./mygameserver &
The server communications is written in TCP. The main function with the read-loop is written using standard C socket.h/netdb.h/in.h, and uses select() to accept many users on a nonblocking listening socket.
When I run the server in the foreground through SSH, everything seems to be working fine. It receives all packets I send in the right order, and with correct header info. When I use nohup and disconnect from SSH however... Everything seems to crash. A typical log from a user connect when the server software runs without SSH/in nohup mode looks like this:
CONNECTED TO SQL
Server starting...
Second thread initialized
Entering order loop
Got a new message of type: 0
Got a new message of type: 0
Got a new message of type: 0
Got a new message of type: 0
<this line continues ad infinitum>
I really have no idea why. I've made sure every print goes to nohup.out instead of std::cout, and the server sends an update to MySQL every 10 seconds to avoid timeouts.
Any ideas or input on what's wrong here? Let me know if you want some code samples, I just don't know which ones are interesting to this problem in particular. Thanks in advance.
I found out what was wrong.
In my server program I have a readSockets() function which is called after a call to select() in the main server loop. The readSockets() function responded to newline character pushed to stdin (for reasons I don't know) by nohup on startup, and as stdin is, in fact, also a FILE* connected to a file descriptor, my readSockets() function responded to stdin as a connecting client.
This obviously made the server crash, as stdin was never flushed (and was therefore read every time select() had returned). This again blocked the thread for other users.

Socket still listening after application crash

I'm having a problem with one of my C++ applications on Windows 2008x64 (same app runs just fine on Windows 2003x64).
After a crash or even sometimes after a regular shutdown/restart cycle it has a problem using a socket on port 82 it needs to receive commands.
Looking at netstat I see the socket is still in listening state more than 10 minutes after the application stopped (the process is definitely not running anymore).
TCP 0.0.0.0:82 LISTENING
I tried setting the socket option to REUSEADDR but as far as I know that only affects re-connecting to a port that's in TIME_WAIT state. Either way this change didn't seem to make any difference.
int doReuse = 1;
setsockopt(listenFd, SOL_SOCKET, SO_REUSEADDR,
(const char *)&doReuse, sizeof(doReuse));
Any ideas what I can do to solve or at least avoid this problem?
EDIT:
Did netstat -an but this is all I am getting:
TCP 0.0.0.0:82 0.0.0.0:0 LISTENING
For netstat -anb I get:
TCP 0.0.0.0:82 0.0.0.0:0 LISTENING
[System]
I'm aware of shutting down gracefully, but even if the app crashes for some reason I still need to be able to restart it. The application in question uses an in-house library that internally uses Windows Sockets API.
EDIT:
Apparently there is no solution for this problem, so for development I will go with a proxy / tool to work around it. Thanks for all the suggestions, much appreciated.
If this is only hurting you at debug time, use tcpview from the sysinternals folks to force the socket closed. I am assuming it works on your platform, but I am not sure.
If you're doing blocking operations on any sockets, do not use an indefinite timeout. This can cause weird behavior on a multiprocessor machine in my experience. I'm not sure what Windows server OS it was, but, it was one or two versions previous to 2003 Server.
Instead of an indefinite timeout, use a 30 to 60 second timeout and then just repeat the wait. This goes for overlapped IO and IOCompletion ports as well, if you're using them.
If this is an app you're shipping for others to use, good luck. Windows can be a pure bastard when using sockets...
I tried setting the socket option to
REUSEADDR but as far as I know that
only affects re-connecting to a port
that's in TIME_WAIT state.
That's not quite correct. It will let you re-use a port in TIME_WAIT state for any purpose, i.e. listen or connect. But I agree it won't help with this. I'm surprised by the comment about the OS taking 10 minutes to detect the crashed listener. It should clean up all resources as soon as the process ends, other than ports in the TIME_WAIT state.
The first thing to check is that it really is your application listening on that port. Use:
netstat -anb
to figure out which process is listenin on that port.
The second thing to check is that your are closing the socket gracefully when your application shuts down. If you're using a high-level socket API that shouldn't be too much of an issue (you are using a socket API, right?).
Finally, how is your application structured? Is it threaded? Does it launch other processes? How do you know that your application is really shut down?
Run
netstat -ano
This will give you the PID of the process that has the port open. Check that process from the task manager. Make sure you have "list processes from all users" is checked.
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html is a great resource for "Bind: Address Already in Use" errors.
Some extracts:
TIME_WAIT is the state that typically ties up the port for several minutes after the process has completed. The length of the associated timeout varies on different operating systems, and may be dynamic on some operating systems, however typical values are in the range of one to four minutes.
Strategies for Avoidance
SO_REUSEADDR
This is the both the simplest and the most effective option for reducing the "address already in use" error.
Client Closes First
TIME_WAIT can be avoided if the remote end initiates the closure. So the server can avoid problems by letting the client close first.
Reduce Timeout
If (for whatever reason) neither of these options works for you, it may also be possible to shorten the timeout associated with TIME_WAIT.
After seeing https://superuser.com/a/453827/56937 I discovered that there was a WerFault process that was suspended.
It must have inherited the sockets from the non-existent process because killing it freed up my listening ports.

socket program is able to connect to the port which is still in TIME_WAIT

I have written a very simple socket server.
It listens in post 63254.
First i did a socket_create, socket_bind, socket_listen so here a connection is listening.
Then in a loop i do the socket accpet. so here another listen.
the read function reads untill i input exit.
after that the resource id by socket_accept closes.
and then the main connection closes.
when i checked this process in TCPview after closing all connections i can still see the system process showing TIME_WAIT for post 63254
if i again run the socket server program it is connecting and when one full process is over all the connection is closed and the program terminated and now i can see another TIME_WAIT for the same port. but still i could connect to the same port the third time.
in stackover question answer it is said that connection cannot be done for port which is in wait state.
I opened firefox browser it opened 4 connections.
when i closed it all closed and the system process showed 4 time waits for 2 minutes.
all time wait stays for 2 minutes and disappears.
so what i conclude is for every connection close the time wait is occurs and cannot be avoided.
i read many posts in stack overflow flow but still wasn't sure of it.
i run the following code in command line.
My server Code
<?
error_reporting(E_ALL);
set_time_limit(0);
ob_implicit_flush();
$str = '';
$buff = '';
$s = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
if(!$s)die('Unable to create socket');
if(!socket_bind($s,'127.0.0.1',63254))
die("\nTrying to Bind: ".socket_strerror(socket_last_error()));
if(!socket_listen($s,1))
die(socket_strerror(socket_last_error()));
while(1)
{
$acc = socket_accept($s);
if(!$acc)die(socket_strerror(socket_last_error()));
// echo "\n".gettype($acc);
if(!$acc)die(socket_strerror(socket_last_error()));
while(1)
{
$str = socket_read($acc,512);
$buff.= $str;
echo $str;
// echo '::'.gettype($str);
if($str===false)die(socket_strerror(socket_last_error()));
if($str=="exit\r\n")break;
}
// if(!socket_shutdown($acc,2))echo socket_strerror(socket_last_error());
socket_close($acc);
if(preg_match('/exit/',$buff))break;
}
//echo "\nConnection closed by server\n";
//if(!socket_shutdown($s,2))echo socket_strerror(socket_last_error());
socket_close($s);
?>
The client code
<?
set_time_limit(0);
$f = fsockopen('127.0.0.1',63254,$a,$b,10);
if(!$f)die('cannot connect');
echo "\nConnected: \n";
do{
$buff = fgets(STDIN);
fwrite($f,$buff);
}while($buff!="exit\r\n");
fclose($f);
?>
need suggestions to improve a better client server if this is not sufficient. this code is just a child's play. just trying to understand the way communication works.
In stackover question answer it is
said that connection cannot be done
for port which is in wait state.
I don't know what answer you're referring to, but you cannot bind to a port which is in TIME_WAIT state. If you are a server you can use setReuseAddress() to overcome this. If you're a client you have to wait, or use a different outbound port, or best of all don't specify an outbound port at all, let the system find one. You are a server so this doesn't apply to you.
I opened firefox browser it opened 4
connections. when i closed it all
closed and the system process showed 4
time waits for 2 minutes. all time
wait stays for 2 minutes and
disappears.
But those are client ports. Outbound ports. At your server they were inbound ports, and there was also a listening port on the same port number. As long as there is a listening port, an inbound connnection can succeed.
so what i conclude is for every
connection close the time wait is
occurs and cannot be avoided.
TIME_WAIT occurs when you are the end that sends the close first. If you are the end that received the close, and closed in response, your port doesn't go into TIME_WAIT at all.