TCP Connection Dropped [duplicate] - c++

I am using a loop to read message out from a c Berkeley socket but I am not able to detect when the socket is disconnected so I would accept a new connection. please help
while(true) {
bzero(buffer,256);
n = read(newsockfd,buffer,255);
printf("%s\n",buffer);
}

The only way you can detect that a socket is connected is by writing to it.
Getting a error on read()/recv() will indicate that the connection is broken, but not getting an error when reading doesn't mean that the connection is up.
You may be interested in reading this:
http://lkml.indiana.edu/hypermail/linux/kernel/0106.1/1154.html
In addition, using TCP Keep Alive may help distinguish between inactive and broken connections (by sending something at regular intervals even if there's no data to be sent by the application).
(EDIT: Removed incorrect sentence as pointed out by #Damon, thanks.)

Your problem is that you are completely ignoring the result returned by read(). Your code after read() should look at least like this:
if (n == 0) // peer disconnected
break;
else if (n == -1) // error
{
perror("read");
break;
}
else // received 'n' bytes
{
printf("%.*s", n, buffer);
}
And accepting a new connection should be done in a separate thread, not dependent on end of stream on this connection.
The bzero() call is pointless, just a workaround for prior errors.

That's because you didn't use keepalive timeout.
In receiving side, keepalive socket option is the best solution for detecting dead connection.
But, in case of your application continue to write to socket, there is something to think more.
Even though you already set keepalive option to your application socket, you can't detect in time the dead connection state of the socket, in case of your app keeps writing on the socket.
That's because of tcp retransmission by the kernel tcp stack.
tcp_retries1 and tcp_retries2 are kernel parameters for configuring tcp retransmission timeout.
It's hard to predict precise time of retransmission timeout because it's calculated by RTT mechanism.
You can see this computation in rfc793. (3.7. Data Communication)
https://www.rfc-editor.org/rfc/rfc793.txt
Each platforms have kernel configurations for tcp retransmission.
Linux : tcp_retries1, tcp_retries2 : (exist in /proc/sys/net/ipv4)
http://linux.die.net/man/7/tcp
HPUX : tcp_ip_notify_interval, tcp_ip_abort_interval
http://www.hpuxtips.es/?q=node/53
AIX : rto_low, rto_high, rto_length, rto_limit
http://www-903.ibm.com/kr/event/download/200804_324_swma/socket.pdf
You should set lower value for tcp_retries2 (default 15) if you want to early detect dead connection, but it's not precise time as I already said.
In addition, currently you can't set those values only for single socket. Those are global kernel parameters.
There was some trial to apply tcp retransmission socket option for single socket(http://patchwork.ozlabs.org/patch/55236/), but I don't think it was applied into kernel mainline. I can't find those options definition in system header files.
For reference, you can monitor your keepalive socket option through 'netstat --timers' like below.
https://stackoverflow.com/questions/34914278
netstat -c --timer | grep "192.0.0.1:43245 192.0.68.1:49742"
tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (1.92/0/0)
tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (0.71/0/0)
tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (9.46/0/1)
tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (8.30/0/1)
tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (7.14/0/1)
tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (5.98/0/1)
tcp 0 0 192.0.0.1:43245 192.0.68.1:49742 ESTABLISHED keepalive (4.82/0/1)
In addition, when keepalive timeout ocurrs, you can meet different return events depending on platforms you use, so you must not decide dead connection status only by return events.
For example, HP returns POLLERR event and AIX returns just POLLIN event when keepalive timeout occurs.
You will meet ETIMEDOUT error in recv() call at that time.
In recent kernel version(since 2.6.37), you can use TCP_USER_TIMEOUT option will work well. This option can be used for single socket.

Related

Socket is open after process, that opened it finished

After closing client socket on sever side and exit application, socket still open for some time.
I can see it via netstat
Every 0.1s: netstat -tuplna | grep 6676
tcp 0 0 127.0.0.1:6676 127.0.0.1:36065 TIME_WAIT -
I use log4cxx logging and telnet appender. log4cxx use apr sockets.
Socket::close() method looks like that:
void Socket::close() {
if (socket != 0) {
apr_status_t status = apr_socket_close(socket);
if (status != APR_SUCCESS) {
throw SocketException(status);
}
socket = 0;
}
}
And it's successfully processed. But after program is finished I can see opened socket via netstat, and if it starts again log4cxx unable to open 6676 port, because it is busy.
I tries to modify log4cxx.
Shutdown socket before close:
void Socket::close() {
if (socket != 0) {
apr_status_t shutdown_status = apr_socket_shutdown(socket, APR_SHUTDOWN_READWRITE);
printf("Socket::close shutdown_status %d\n", shutdown_status);
if (shutdown_status != APR_SUCCESS) {
printf("Socket::close WTF %d\n", shutdown_status != APR_SUCCESS);
throw SocketException(shutdown_status);
}
apr_status_t close_status = apr_socket_close(socket);
printf("Socket::close close_status %d\n", close_status);
if (close_status != APR_SUCCESS) {
printf("Socket::close WTF %d\n", close_status != APR_SUCCESS);
throw SocketException(close_status);
}
socket = 0;
}
}
But it didn't helped, bug still reproduced.
This is not a bug. Time Wait (and Close Wait) is by design for safety purpose. You may however adjust the wait time. In any case, on server's perspective the socket is closed and you are relax by the ulimit counter, it has not much visible impact unless you are doing stress test.
As noted by Calvin this isn't a bug, it's a feature. Time Wait is a socket state that says, this socket isn't in use any more but nevertheless can't be reused quite yet.
Imagine you have a socket open and some client is sending data. The data may be backed up in the network or be in-flight when the server closes its socket.
Now imagine you start the service again or start some new service. The packets on the wire aren't aware that its a new service and the service can't know the packets were destined for a service that's gone. The new service may try to parse the packets and fail because they're in some odd format or the client may get an unrelated error back and keep trying to send, maybe because the sequence numbers don't match and the receiving host will get some odd error. With timed wait the client will get notified that the socket is closed and the server won't potentially get odd data. A win-win. The time it waits should be sofficient for all in-transit data to be flused from the system.
Take a look at this post for some additional info: Socket options SO_REUSEADDR and SO_REUSEPORT, how do they differ? Do they mean the same across all major operating systems?
TIME_WAIT is a socket state to allow all in travel packets that could remain from the connection to arrive or dead before the connection parameters (source address, source port, desintation address, destination port) can be reused again. The kernel simply sets a timer to wait for this time to elapse, before allowing you to reuse that socket again. But you cannot shorten it (even if you can, you had better not to do it), because you have no possibility to know if there are still packets travelling or to accelerate or kill them. The only possibility you have is to wait for a socket bound to that port to timeout and pass from the state TIME_WAIT to the CLOSED state.
If you were allowed to reuse the connection (I think there's an option or something can be done in the linux kernel) and you receive an old connection packet, you can get a connection reset due to the received packet. This can lead to more problems in the new connection. These are solved making you wait for all traffic belonging to the old connection to die or reach destination, before you use that socket again.

sockets go to not closing after 32739 connections

UPDATE : After investigating lil more I found the real problem for this behavior . Problem is, I am creating the threads for each connection and passing the sock fd to the thread but was not pthraed_joining immediately so that made my main thread not to able to create any more threads after the connection acceptance. and my logic of closing the socket is in child thread, coz of that i was not able to close the socket and hence they were going to WAIT CLOSE state. SO I just detached the threads after creating them and all works well as of now !!
I have a client server program, I am using a script to run the client and make as many as connections possible and close them after sending a line of data and exit the client, every thing works fine until 32739 th connection i.e. connection is closed on both the sides and all but after that number the connection is not getting closed and server stops taking any more connections and if do
netstat -tonpa 2>&1 | grep CLOSE
I see around 1020 sockets waiting for CLOSE. sample out of the command,
tcp 25 0 192.168.0.175:16099 192.168.0.175:41704 CLOSE_WAIT 5250/./bl_manager off (0.00/0/0)
tcp 24 0 192.168.0.175:16099 192.168.0.175:41585 CLOSE_WAIT 5250/./bl_manager off (0.00/0/0)
tcp 30 0 192.168.0.175:16099 192.168.0.175:41679 CLOSE_WAIT 5250/./bl_manager off (0.00/0/0)
tcp 31 0 192.168.0.175:16099 192.168.0.175:41339 CLOSE_WAIT 5250/./bl_manager off (0.00/0/0)
tcp 25 0 192.168.0.175:16099 192.168.0.175:41760 CLOSE_WAIT 5250/./bl_manager off (0.00/0/0)
I am using following code to detect the client disconnection.
for(fd = 0; fd <= fd_max; fd++) {
if(FD_ISSET(fd, &testfds)) {
if (fd == client_fd) {
ioctl(fd, FIONREAD, &nread);
if(nread == 0) {
FD_CLR(fd, &readfds);
close(fd);
return 0;
}
}
}
} /* for()*/
Please do let me know if am doing anything wrong. Its a Python client and CPP server setup.
thank you
CLOSE-WAIT means the port is waiting for the local application to close the socket, having already received a close from the peer. Clearly you are leaking sockets somehow, possibly in an error path.
Your code to 'detect client disconnection' is completely incorrect. All you are testing is the amount of data that can be read without blocking, i.e. that has already arrived. The correct test is a return value of zero from recv() or an error other than EAGAIN/EWOULDBLOCK when reading or writing.
Without knowing your platform, I can't be sure, but the fact that you're clearly using select, and you're having a problem only a few dozen away from 32768, it seems very likely that this is your problem.
An fd_set is a collection of bits, indexed by file descriptor numbers. Every platform has a different max number. OpenBSD and recent versions of FreeBSD and OS X usually limit fd_set to an FD_SETSIZE that defaults to 1024. Different linux boxes seem to have 1024, 4096, 32768, and 65536.
So, what happens if you FD_ISSET(32800, &testfds) and FD_SETSIZE is 32768? You're asking it to read a bit from arbitrary memory.
A select or other call before this should give you an EINVAL error when you pass in 32800 for the nfds parameter… but historically, many platforms have not done so. Or they have returned an error, but only after filling in the first FD_SETSIZE bits properly and leaving the rest set to uninitialized memory, which means if you forget to check the error, your code seems to work until you stress it.
This is one of the reasons using select for more than a few hundred sockets is a bad idea. The other reason is that select is linear (and, worse, not linear on the number of current sockets, but linear on the highest fd, so even after most clients go away it's still slow).
Most modern platforms that have select also have poll, which avoids that problem.
Unless you're on Windows… in which case there are completely different reasons not to use select, and different answers.

How to get return value of 0 in UDP recvfrom() call after shutdown call.?

I have tried getting the value of 0 over the network when the socket connection has been gracefully closed by the sender as specified here. When I used unblocked call I was getting -1 in the UDP stream before data was sent from sender to the receiver . After the original data was sent and when I closed the connection(tried shutting down the socket and closing the socket on the sender side) I was still getting -1 rather than getting 0 indicating the socket has been closed. can anybody please help is there is any way to get the same.
Thanks.
When UDP socket is close(2)-ed there's nothing sent out, even if the socket was connect(2)-ed. TCP, on the other hand, initiates four-way connection tear-down. Looks like you are confusing these two cases.
UNIX man page for shutdown states the following:
Return Value:
On success, zero is returned. On error, -1 is returned,
and errno is set appropriately.
Errors:
EBADF - sockfd is not a valid descriptor.
ENOTCONN - The specified socket is not connected.
ENOTSOCK - sockfd is a file, not a socket.
And Windows platform have quite the same:
Return value
If no error occurs, shutdown returns zero. Otherwise, a value of
SOCKET_ERROR is returned, and a specific error code can be retrieved
by calling WSAGetLastError.
Thing is: UDP is not connection oriented protocol and connect() call for it do not mean that any association is established whatsoever.
So my guess, you're actually getting ENOTCONN error (or WSAENOTCONN, if you're on Windows)
Check your errno (or WSAGetLastError() on Windows)

Why would connect() give EADDRNOTAVAIL?

I have in my application a failure that arose which does not seem to be reproducible. I have a TCP socket connection which failed and the application tried to reconnect it. In the second call to connect() attempting to reconnect, I got an error result with errno == EADDRNOTAVAIL which the man page for connect() says means: "The specified address is not available from the local machine."
Looking at the call to connect(), the second argument appears to be the address to which the error is referring to, but as I understand it, this argument is the TCP socket address of the remote host, so I am confused about the man page referring to the local machine. Is it that this address to the remote TCP socket host is not available from my local machine? If so, why would this be? It had to have succeeded calling connect() the first time before the connection failed and it attempted to reconnect and got this error. The arguments to connect() were the same both times.
Would this error be a transient one which, if I had tried calling connect again might have gone away if I waited long enough? If not, how should I try to recover from this failure?
Check this link
http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-not.html
EDIT: Yes I meant to add more but had to cut it there because of an emergency
Did you close the socket before attempting to reconnect? Closing will tell the system that the socketpair (ip/port) is now free.
Here are additional items too look at:
If the local port is already connected to the given remote IP and port (i.e., there's already an identical socketpair), you'll receive this error (see bug link below).
Binding a socket address which isn't the local one will produce this error. if the IP addresses of a machine are 127.0.0.1 and 1.2.3.4, and you're trying to bind to 1.2.3.5 you are going to get this error.
EADDRNOTAVAIL: The specified address is unavailable on the remote machine or the address field of the name structure is all zeroes.
Link with a bug similar to yours (answer is close to the bottom)
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4294599
It seems that your socket is basically stuck in one of the TCP internal states and that adding a delay for reconnection might solve your problem as they seem to have done in that bug report.
This can also happen if an invalid port is given, like 0.
If you are unwilling to change the number of temporary ports available (as suggested by David), or you need more connections than the theoretical maximum, there are two other methods to reduce the number of ports in use. However, they are to various degrees violations of the TCP standard, so they should be used with care.
The first is to turn on SO_LINGER with a zero-second timeout, forcing the TCP stack to send a RST packet and flush the connection state. There is one subtlety, however: you should call shutdown on the socket file descriptor before you close, so that you have a chance to send a FIN packet before the RST packet. So the code will look something like:
shutdown(fd, SHUT_RDWR);
struct linger linger;
linger.l_onoff = 1;
linger.l_linger = 0;
// todo: test for error
setsockopt(fd, SOL_SOCKET, SO_LINGER,
(char *) &linger, sizeof(linger));
close(fd);
The server should only see a premature connection reset if the FIN packet gets reordered with the RST packet.
See TCP option SO_LINGER (zero) - when it's required for more details. (Experimentally, it doesn't seem to matter where you set setsockopt.)
The second is to use SO_REUSEADDR and an explicit bind (even if you're the client), which will allow Linux to reuse temporary ports when you run, before they are done waiting. Note that you must use bind with INADDR_ANY and port 0, otherwise SO_REUSEADDR is not respected. Your code will look something like:
int opts = 1;
// todo: test for error
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR,
(char *) &opts, sizeof(int));
struct sockaddr_in listen_addr;
listen_addr.sin_family = AF_INET;
listen_addr.sin_port = 0;
listen_addr.sin_addr.s_addr = INADDR_ANY;
// todo: test for error
bind(fd, (struct sockaddr *) &listen_addr, sizeof(listen_addr));
// todo: test for addr
// saddr is the struct sockaddr_in you're connecting to
connect(fd, (struct sockaddr *) &saddr, sizeof(saddr));
This option is less good because you'll still saturate the internal kernel data structures for TCP connections as per netstat -an | grep -e tcp -e udp | wc -l. However, you won't start reusing ports until this happens.
I got this issue. I got it resolve by enabling tcp timestamp.
Root cause:
After connection close, Connections will go in TIME_WAIT state for some
time.
During this state if any new connections comes with same IP and PORT,
if SO_REUSEADDR is not provided during socket creation then socket bind()
will fail with error EADDRINUSE.
But even though after providing SO_REUSEADDR also sockect connect() may
fail with error EADDRNOTAVAIL if tcp timestamp is not enable on both side.
Solution:
Please enable tcp timestamp on both side client and server.
echo 1 > /proc/sys/net/ipv4/tcp_timestamps
Reason to enable tcp_timestamp:
When we enable tcp_tw_reuse, sockets in TIME_WAIT state can be used before they expire, and the kernel will try to make sure that there is no collision regarding TCP sequence numbers. If we enable tcp_timestamps, it will make sure that those collisions cannot happen. However, we need TCP timestamps to be enabled on both ends. See the definition of tcp_twsk_unique for the gory details.
reference:
https://serverfault.com/questions/342741/what-are-the-ramifications-of-setting-tcp-tw-recycle-reuse-to-1
Another thing to check is that the interface is up. I got confused by this one recently while using network namespaces, since it seems creating a new network namespace produces an entirely independent loopback interface but doesn't bring it up (at least, with Debian wheezy's versions of things). This escaped me for a while since one doesn't typically think of loopback as ever being down.

Socket in use error when reusing sockets

I am writing an XMLRPC client in c++ that is intended to talk to a python XMLRPC server.
Unfortunately, at this time, the python XMLRPC server is only capable of fielding one request on a connection, then it shuts down, I discovered this thanks to mhawke's response to my previous query about a related subject
Because of this, I have to create a new socket connection to my python server every time I want to make an XMLRPC request. This means the creation and deletion of a lot of sockets. Everything works fine, until I approach ~4000 requests. At this point I get socket error 10048, Socket in use.
I've tried sleeping the thread to let winsock fix its file descriptors, a trick that worked when a python client of mine had an identical issue, to no avail.
I've tried the following
int err = setsockopt(s_,SOL_SOCKET,SO_REUSEADDR,(char*)TRUE,sizeof(BOOL));
with no success.
I'm using winsock 2.0, so WSADATA::iMaxSockets shouldn't come into play, and either way, I checked and its set to 0 (I assume that means infinity)
4000 requests doesn't seem like an outlandish number of requests to make during the run of an application. Is there some way to use SO_KEEPALIVE on the client side while the server continually closes and reopens?
Am I totally missing something?
The problem is being caused by sockets hanging around in the TIME_WAIT state which is entered once you close the client's socket. By default the socket will remain in this state for 4 minutes before it is available for reuse. Your client (possibly helped by other processes) is consuming them all within a 4 minute period. See this answer for a good explanation and a possible non-code solution.
Windows dynamically allocates port numbers in the range 1024-5000 (3977 ports) when you do not explicitly bind the socket address. This Python code demonstrates the problem:
import socket
sockets = []
while True:
s = socket.socket()
s.connect(('some_host', 80))
sockets.append(s.getsockname())
s.close()
print len(sockets)
sockets.sort()
print "Lowest port: ", sockets[0][1], " Highest port: ", sockets[-1][1]
# on Windows you should see something like this...
3960
Lowest port: 1025 Highest port: 5000
If you try to run this immeditaely again, it should fail very quickly since all dynamic ports are in the TIME_WAIT state.
There are a few ways around this:
Manage your own port assignments and
use bind() to explicitly bind your
client socket to a specific port
that you increment each time your
create a socket. You'll still have
to handle the case where a port is
already in use, but you will not be
limited to dynamic ports. e.g.
port = 5000
while True:
s = socket.socket()
s.bind(('your_host', port))
s.connect(('some_host', 80))
s.close()
port += 1
Fiddle with the SO_LINGER socket
option. I have found that this
sometimes works in Windows (although
not exactly sure why):
s.setsockopt(socket.SOL_SOCKET,
socket.SO_LINGER, 1)
I don't know if this will help in
your particular application,
however, it is possible to send
multiple XMLRPC requests over the
same connection using the
multicall method. Basically
this allows you to accumulate
several requests and then send them
all at once. You will not get any
responses until you actually send
the accumulated requests, so you can
essentially think of this as batch
processing - does this fit in with
your application design?
Update:
I tossed this into the code and it seems to be working now.
if(::connect(s_, (sockaddr *) &addr, sizeof(sockaddr)))
{
int err = WSAGetLastError();
if(err == 10048) //if socket in user error, force kill and reopen socket
{
closesocket(s_);
WSACleanup();
WSADATA info;
WSAStartup(MAKEWORD(2,0), &info);
s_ = socket(AF_INET,SOCK_STREAM,0);
setsockopt(s_,SOL_SOCKET,SO_REUSEADDR,(char*)&x,sizeof(BOOL));
}
}
Basically, if you encounter the 10048 error (socket in use), you can simply close the socket, call cleanup, and restart WSA, the reset the socket and its sockopt
(the last sockopt may not be necessary)
i must have been missing the WSACleanup/WSAStartup calls before, because closesocket() and socket() were definitely being called
this error only occurs once every 4000ish calls.
I am curious as to why this may be, even though this seems to fix it.
If anyone has any input on the subject i would be very curious to hear it
Do you close the sockets after using it?