CLOSE_WAIT socket memory implications - c++

I'm working on a big server in C++ where we provide service to multiple users. I noticed recently that after some cases in my code, sockets are not closed properly and we have thousands of them in the state CLOSE_WAIT. I read what this mean
CLOSE_WAIT Indicates that the server has received the first FIN
signal from the client and the connection is in the process of being
closed
So this essentially means that his is a state where socket is waiting
for the application to execute close()
A socket can be in CLOSE_WAIT state indefinitely until the application
closes it. Faulty scenarios would be like filedescriptor leak, server
not being execute close() on socket leading to pile up of close_wait
sockets
I'm starting to search where is my bug because it's clear that we are not closing properly these sockets.
What implies for my program memory having all these sockets in CLOSE_WAIT state? I noticed that when I have a lot of these sockets the process start to consume a lot of memory (4GB). Which implications in relation with memory have this? Maybe this memory leak is not related with this? Thank you very much.

Related

Socket recv() function

In c++, in windows OS, on the recv() call for TCP socket , if the socket connection is closed somehow, the recv() will return immediately or will it hang?
What would be result(immediately returns or hangs) in the blocking and non blocking socket?
I am using socket version 2.
Thanks in advance.
As documented depending of how the connection was closed it should more or less immediately with a return value of SOCKET_ERROR in a non graceful case and set WSAGetLastError to one of the possible reasons like WSAENOTCONN or would just return 0 in the graceful connection close scenario. This would be the same between blocking and nonblocking sockets.
If the socket is connection oriented and the remote side has shut down the connection gracefully, and all data has been received, a recv will complete immediately with zero bytes received. If the connection has been reset, a recv will fail with the error WSAECONNRESET.
However, since I know the Windows API does not always work as documented, I recommend to test it.
If your recv() is based on BSD - as almost all are - it will return immediately if the connection is closed (per the local side's status), regardless of whether the socket is blocking or non-blocking.
recv() will return 0 upon a graceful disconnect, ie the peer shutdown its end of the connection and its socket stack sent a FIN packet to your socket stack. You are pretty much guaranteed to get this result immediately, whether you use a blocking or non-blocking socket.
recv() will return -1 on any other error, including an abnormal connection loss.
You need to use WSAGetLastError() to find out what actually happened. For a lost connection on a blocking socket, you will usually get an error code such as WSAECONNRESET or WSAECONNABORTED. For a lost connection on a non-blocking socket, it is possible that recv() may report an WSAEWOULDBLOCK error immediately and then report the actual error at some later time, maybe via select() with an exception fd_set, or an asynchronous notification, depending on how you implement your non-blocking logic.
However, either way, you are NOT guaranteed to get a failure result on a lost connection in any timely manner! It MAY take some time (seconds, minutes, can even be hours in rare cases) before the OS deems the connection is actually lost and invalidates the socket connection. TCP is designed to recover lost connections when possible, so it has to account for temporary network outages and such, so there are internal timeouts. You don't see that in your code, it happens in the background.
If you don't want to wait for the OS to timeout internally, you can always use your own timeout in your code, such as via select(), setsocktopt(SO_RCVTIMEO), TCP keep-alives (setsockopt(SO_KEEPALIVE) or WSAIoCtl(SIO_KEEPALIVE_VALS)), etc. You may still not get a failure immediately, but you will get it sooner rather than later.

What common programming mistakes can cause stuck CLOSE_WAIT in epoll edge triggered mode?

I'm wondering what common programming situations/bugs might cause a server process I have enter into CLOSE_WAIT but not actually close the socket.
What I'm wanting to do is trigger this situation so that I can fix it. In a normal development environment I've not been able to trigger it, but the same code used on a live server is occasionally getting them so that after many many days we have hundreds of them.
Googling for close_wait and it actually seems to be a very common problem, even in mature and supposedly well written services like nginx.
CLOSE_WAIT is basically when the remote end shut down the socket but the local application has not yet invoked a close() on it. This is usually happens when you are not expecting to read data from the socket and thus aren't watching it for readability.
Many applications for convenience sake will always monitor a socket for readability to detect a close.
A scenario to try out is this:
Peer sends 2k of data and immediately closes the data
Your socket is then registered with epoll and gets a notification for readability
Your application only reads 1k of data
You stop monitoring the socket for readability
(I'm not sure if edge-triggered epoll will end up delivering the shutdown event as a separate event).
See also:
(from man epoll_ctl)
EPOLLRDHUP (since Linux 2.6.17)
Stream socket peer closed connection, or shut down writing half of connection. (This flag is especially useful for writing
simple code
to detect peer shutdown when using Edge Triggered monitoring.)

Handling POSIX socket read() errors

Currently I am implementing a simple client-server program with just the basic functionalities of read/write.
However I noticed that if for example my server calls a write() to reply my client, and if my client does not have a corresponding read() function, my server program will just hang there.
Currently I am thinking of using a simple timer to define a timeout count, and then to disconnect the client after a certain count, but I am wondering if there is a more elegant/or standard way of handling such errors?
There are two general approaches to prevent server blocking and to handle multiple clients by a single server instance:
use POSIX threads to handle each client's connection. If one thread blocks because of erroneous client, other threads will still continue to run. If the remote client has just disappeared (crashed, network down, etc.), then sooner or later the TCP stack will signal a timeout and the blocked write operation will fail with error.
use non-blocking I/O together with a polling mechanism, e.g. select(2) or poll(2). It is quite harder to program using polling calls though. Network sockets are made non-blocking using fcntl(2) and in cases where a normal write(2) or read(2) on the socket would block an EAGAIN error is returned instead. You can use select(2) or poll(2) to wait for something to happen on the socket with an adjustable timeout period. For example, waiting for the socket to become writable, means that you will be notified when there is enough socket send buffer space, e.g. previously written data was flushed to the client machine TCP stack.
If the client side isn't going to read from the socket anymore, it should close down the socket with close. And if you don't want to do that because the client still might want to write to the socket, then you should at least close the read half with shutdown(fd, SHUT_RD).
This will set it up so the server gets an EPIPE on the write call.
If you don't control the clients... if random clients you didn't write can connect, the server should handle clients actively attempting to be malicious. One way for a client to be malicious is to attempt to force your server to hang. You should use a combination of non-blocking sockets and the timeout mechanism you describe to keep this from happening.
In general you should write the protocols for how the server and client communicate so that neither the server or client are trying to write to the socket when the other side isn't going to be reading. This doesn't mean you have to synchronize them tightly or anything. But, for example, HTTP is defined in such a way that it's quite clear for either side as to whether or not the other side is really expecting them to write anything at any given point in the protocol.

c++ detect if the client closed the connection (multi sockets)

In my program I have several sockets on the server. Each socket has its own port. I tried to detect if the client closed the connection with:
signal(SIGPIPE, sig_pipe);
But I have the problem that I don't know on which socket the connection was closed.
Is there some method to get it to know?
More about code:
In main program I started 3 Sockets on different ports. Accept, receive and send for each socket I put in one thread. So I have 3 threads at the end.
Thank you.
You should setup SIGPIPE to be ignored (see sigaction(2)) and handle EPIPE error code from write(2) and the likes.
Note, that reading zero bytes from TCP socket is the real indication of the other side closing the connection.

listening socket dies unexpectedly

I'm having a problem where a TCP socket is listening on a port, and has been working perfectly for a very long time - it's handled multiple connections, and seems to work flawlessly. However, occasionally when calling accept() to create a new connection the accept() call fails, and I get the following error string from the system:
10022: An invalid argument was supplied.
Apparently this can happen when you call accept() on a socket that is no longer listening, but I have not closed the socket myself, and have not been notified of any errors on that socket.
Can anyone think of any reasons why a listening socket would stop listening, or how the error mentioned above might be generated?
Some possibilities:
Some other part of your code overwrote the handle value. Check to see if it has changed (keep a copy somewhere else and compare, print it out, breakpoint on write in the debugger, whatever).
Something closed the handle.
Interactions with a buggy Winsock LSP.
One thing that comes to my mind is system standy or hibernation mode. I'm not sure how these events are handled by the winsock Library. Might be that the network interface is (partially) shut down.
It might make sense to debug the socket's thread (either with an IDE or through a disassembler) and watch its execution for anything that might be causing it to stop listening.