Linux CentOS 5: non-blocking socket send hangs indefenitely - c++

I have the following C++ code in linux:
if (epoll_wait(hEvent,&netEvents,1,0))
{
// check FIRST for disconnection to avoid send() to a closed socket (halts on centos on my server!)
if ((netEvents.events & EPOLLERR)||(netEvents.events & EPOLLHUP)||(netEvents.events & EPOLLRDHUP)) {
save_log("> client terminated connection");
goto connection_ended; // ---[ if its a CLOSE event .. close :)
}
if (netEvents.events & EPOLLOUT) // ---[ if socket is available for write
{
if (send_len) {
result = send(s,buffer,send_len,MSG_NOSIGNAL);
save_slogf("1112:send (s=%d,len=%d,ret=%d,errno=%d,epoll=%d,events=%d)",s,send_len,result,errno,hEvent,netEvents.events);
if (result > 0) {
send_len = 0;
current_stage = CL_STAGE_USE_LINK_BRIDGE;
if (close_after_send_response) {
save_log("> destination machine closed connection");
close_after_send_response = false;
goto connection_ended;
}
} else {
if (errno == EAGAIN) return;
else if (errno == EWOULDBLOCK) return;
else {
save_log("> unexpected error on socket, terminating");
connection_ended:
close_client();
reset();
return;
}
}
}
}
}
}
hEvent: epoll created to listen to EPOLLIN,EPOLLOUT,EPOLLERR,EPOLLHUP,EPOLLRDHUP
s: NON-BLOCKING (!!!) socket created from an accept on a nonblocking listening socket
Basically this code is attempting to send a packet back to a connected user that connected to a server. It usually works ok but on RANDOM occasions (perhaps when some wierd network event happends) the program hangs indefinitely on the "result = send(s,buffer,send_len,MSG_NOSIGNAL) line.
I have no idea what may be the cause for this, I have tried to monitor the socket operations and nothing seemed to give me a hint of a clue as to why it happends. I have to assume this is either a KERNEL bug or something very wierd because I have the same program written under Windows and it works perfect there.

Related

Close connection with client after inactivty period

I'm currently managing a server that can serve at most MAX_CLIENTS clients concurrently.
This is the code I've written so far:
//create and bind listen_socket_
struct pollfd poll_fds_[MAX_CLIENTS];
for (auto& poll_fd: poll_fds_)
{
poll_fd.fd = -1;
}
listen(listen_socket_, MAX_CLIENTS);
poll_fds_[0].fd = listen_socket_;
poll_fds_[0].events = POLLIN;
while (enabled)
{
const int result = poll(poll_fds_, MAX_CLIENTS, DEFAULT_TIMEOUT);
if (result == 0)
{
continue;
}
else if (result < 0)
{
// throw error
}
else
{
for (auto& poll_fd: poll_fds_)
{
if (poll_fd.revents == 0)
{
continue;
}
else if (poll_fd.revents != POLLIN)
{
// throw error
}
else if (poll_fd.fd == listen_socket_)
{
int new_socket = accept(listen_socket_, nullptr, nullptr);
if (new_socket < 0)
{
// throw error
}
else
{
for (auto& poll_fd: poll_fds_)
{
if (poll_fd.fd == -1)
{
poll_fd.fd = new_socket;
poll_fd.events = POLLIN;
break;
}
}
}
}
else
{
// serve connection
}
}
}
}
Everything is working great, and when a client closes the socket on its side, everything gets handled well.
The problem I'm facing is that when a client connects and send a requests, if it does not close the socket on its side afterwards, I do not detect it and leave that socket "busy".
Is there any way to implement a system to detect if nothing is received on a socket after a certain time? In that way I could free that connection on the server side, leaving room for new clients.
Thanks in advance.
You could close the client connection when the client has not sent any data for a specific time.
For each client, you need to store the time when the last data was received.
Periodically, for example when poll() returns because the timeout expired, you need to check this time for all clients. When this time to too long ago, you can shutdown(SHUT_WR) and close() the connection. You need to determine what "too long ago" is.
If a client does not have any data to send but wants to leave the connection open, it could send a "ping" message periodically. The server could reply with a "pong" message. These are just small messages with no actual data. It depends on your client/server protocol whether you can implement this.

Socket send() hangs when in CLOSE_WAIT state

I have a C++ server application, written using the POCO framework. The server application is acting as a HTTP server in this case. There is a client application which I don't control and cannot debug that is causing a problem in the server. The client requests a large file, which is returned as the HTTP response. During the return of the file the client closes the connection. I see the socket move to the CLOSE_WAIT state, indicating that the client has sent a FIN. The trouble is that in my application the send() function then hangs causing one of my HTTP threads to be basically lost, and once all the threads enter this state the server is unresponsive.
The send code is inside the POCO framework, but looks like this:
do
{
if (_sockfd == POCO_INVALID_SOCKET) throw InvalidSocketException();
rc = ::send(_sockfd, reinterpret_cast<const char*>(buffer), length, flags);
}
while (_blocking && rc < 0 && lastError() == POCO_EINTR);
if (rc < 0) error();
return rc;
(flags are 0 in calls to this function). I tried to detect this state by adding the following code:
char c;
int r;
int rc;
do
{
// Check if FIN received
while ((r = recv(_sockfd, &c, 1, MSG_DONTWAIT)) == 1) {}
if (r == 0) { ::close(_sockfd); _sockfd = POCO_INVALID_SOCKET; } // FIN received
if (_sockfd == POCO_INVALID_SOCKET) throw InvalidSocketException();
rc = ::send(_sockfd, reinterpret_cast<const char*>(buffer), length, flags);
}
while (_blocking && rc < 0 && lastError() == POCO_EINTR);
if (rc < 0) error();
return rc;
This appears to make things better, but still not solve the problem. I end up with the server not hanging as quickly, but many more CLOSE_WAIT sockets, so I think I have partially solved the thread hanging issue, but I have still not tidied up correctly from the broken socket. With this change in place the problem happens less, but still happens, so I think the key to this is understanding why send() hangs.
I'm testing this code on linux.
To cleanly close a socket:
Call shutdown with SD_SEND.
Keep reading from the socket until read returns zero or a fatal error.
Close the socket.
Do not attempt to access the socket after you've closed it.
Your code has two major issues. It doesn't ensure that close is always called on the socket no matter what happens, and it can access the socket after it has closed it. The former is causing your CLOSE_WAIT problem. The latter is a huge security hole.

How to catch a "connection reset by peer" error in C socket?

I have a C++ and Qt application which part of it implements a C socket client. Some time ago by app crashed because something happened with the server; the only thing I got from that crash was a message in Qt Creator's Application Output stating
recv_from_client: Connection reset by peer
I did some research on the web about this "connection reset by peer" error and while some threads here in SO and other places did managed to explain what is going on, none of them tells how to handle it - that is, how can I "catch" the error and continue my application without a crash (particularly the method where I read from the server is inside a while loop, so I'ld like to stop the while loop and enter in another place of my code that will try to re-establish the connection).
So how can I catch this error to handle it appropriately? Don't forget that my code is actually C++ with Qt - the C part is a library which calls the socket methods.
EDIT
Btw, the probable method from which the crash originated (given the "recv_from_client" part of the error message above) was:
int hal_socket_read_from_client(socket_t *obj, u_int8_t *buffer, int size)
{
struct s_socket_private * const socket_obj = (struct s_socket_private *)obj;
int retval = recv(socket_obj->client_fd, buffer, size, MSG_DONTWAIT); //last = 0
if (retval < 0)
perror("recv_from_client");
return retval;
}
Note: I'm not sure if by the time this error occurred, the recv configuration was with MSG_DONTWAIT or with 0.
Just examine errno when read() returns a negative result.
There is normally no crash involved.
while (...) {
ssize_t amt = read(sock, buf, size);
if (amt > 0) {
// success
} else if (amt == 0) {
// remote shutdown (EOF)
} else {
// error
// Interrupted by signal, try again
if (errno == EINTR)
continue;
// This is fatal... you have to close the socket and reconnect
// handle errno == ECONNRESET here
// If you use non-blocking sockets, you also have to handle
// EWOULDBLOCK / EAGAIN here
return;
}
}
It isn't an exception or a signal. You can't catch it. Instead, you get an error which tells you that the connection has been resetted when trying to work on that socket.
int rc = recv(fd, ..., ..., ..., ...);
if (rc == -1)
{
if (errno == ECONNRESET)
/* handle it; there isn't much to do, though.*/
else
perror("Error while reading");
}
As I've written, there isn't much you can do. If you're using some I/O multiplexer, you may want to remove that file descriptor from further monitoring.

Why does OpenSSL give me a "called a function you should not call" error?

I'm working on adding OpenSSL support to my server program, and generally it's working pretty well, but I have come across a problem.
First, some background: The server is single-threaded and uses non-blocking I/O and a select() loop to handle multiple clients simultaneously. The server is linked to libssl.0.9.8.dylib and lib crypto.0.9.8.dylib (i.e. the libraries provided in /usr/lib by MacOS/X 10.8.5). The client<->server protocol is a proprietary full-duplex messaging protocol; that is, the clients and the server are all allowed to send and receive data at any time, and the client<->server TCP connections remain connected indefinitely (i.e. until the client or server decides to disconnect).
The issue is this: my clients can connect to the server, and sending and receiving data works fine (now that I got the SSL_ERROR_WANT_WRITE and SSL_ERROR_WANT_READ logic sorted out)… but if a the server accept()'s a new client connection while other clients are in the middle of sending or receiving data, the SSL layer seems to break. In particular, immediately after the server runs the SetupSSL() routine (shown below) to set up the newly-accepted socket, SSL_read() on one or more of the other (pre-existing) clients' sockets will return -1, and ERR_print_errors_fp(stderr) gives this output:
SSL_read() ERROR: 5673:error:140F3042:SSL routines:SSL_UNDEFINED_CONST_FUNCTION:called a function you should not call:/SourceCache/OpenSSL098/OpenSSL098-47.2/src/ssl/ssl_lib.c:2248:
After this error first appears, the server largely stops working. Data movement stops, and if I try to connect another client I often get this error:
SSL_read() ERROR: 5673:error:140760FC:SSL routines:SSL23_GET_CLIENT_HELLO:unknown protocol:/SourceCache/OpenSSL098/OpenSSL098-47.2/src/ssl/s23_srvr.c:578:
This happens about 25% of the time in my test scenario. If I make sure that my pre-existing client connections are idle (no data being sent or received) at the moment when the new client connects, it never happens. Does anyone know what might be going wrong here? Have I found an OpenSSL bug, or is there some detail that I'm overlooking? Some relevant code from my program is pasted below, in case it's helpful.
// Socket setup routine, called when the server accepts a new TCP socket
int SSLSession :: SetupSSL(int sockfd)
{
_ctx = SSL_CTX_new(SSLv23_method());
if (_ctx)
{
SSL_CTX_set_mode(_ctx, SSL_MODE_ENABLE_PARTIAL_WRITE);
_ssl = SSL_new(_ctx);
if (_ssl)
{
_sbio = BIO_new_socket(sockfd, BIO_NOCLOSE);
if (_sbio)
{
SSL_set_bio(_ssl, _sbio, _sbio);
SSL_set_accept_state(_ssl);
BIO_set_nbio(_sbio, !blocking);
ERR_print_errors_fp(stderr);
return RESULT_SUCCESS;
}
else fprintf(stderr, "SSLSession: BIO_new_socket() failed!\n");
}
else fprintf(stderr, "SSLSession: SSL_new() failed!\n");
}
else fprintf(stderr, "SSLSession: SSL_CTX_new() failed!\n");
return RESULT_FAILURE;
}
// Socket read routine -- returns number of bytes read from SSL-land
int32 SSLSession :: Read(void *buffer, uint32 size)
{
if (_ssl == NULL) return -1;
int32 bytes = SSL_read(_ssl, buffer, size);
if (bytes > 0)
{
_sslState &= ~(SSL_STATE_READ_WANTS_READABLE_SOCKET | SSL_STATE_READ_WANTS_WRITEABLE_SOCKET);
}
else if (bytes == 0) return -1; // connection was terminated
else
{
int err = SSL_get_error(_ssl, bytes);
if (err == SSL_ERROR_WANT_WRITE)
{
// We have to wait until our socket is writeable, and then repeat our SSL_read() call.
_sslState &= ~SSL_STATE_READ_WANTS_READABLE_SOCKET;
_sslState |= SSL_STATE_READ_WANTS_WRITEABLE_SOCKET;
bytes = 0;
}
else if (err == SSL_ERROR_WANT_READ)
{
// We have to wait until our socket is readable, and then repeat our SSL_read() call.
_sslState |= SSL_STATE_READ_WANTS_READABLE_SOCKET;
_sslState &= ~SSL_STATE_READ_WANTS_WRITEABLE_SOCKET;
bytes = 0;
}
else
{
fprintf(stderr, "SSL_read() ERROR: ");
ERR_print_errors_fp(stderr);
}
}
return bytes;
}
// Socket write routine -- returns number of bytes written to SSL-land
int32 SSLSession :: Write(const void *buffer, uint32 size)
{
if (_ssl == NULL) return -1;
int32 bytes = SSL_write(_ssl, buffer, size);
if (bytes > 0)
{
_sslState &= ~(SSL_STATE_WRITE_WANTS_READABLE_SOCKET | SSL_STATE_WRITE_WANTS_WRITEABLE_SOCKET);
}
else if (bytes == 0) return -1; // connection was terminated
else
{
int err = SSL_get_error(_ssl, bytes);
if (err == SSL_ERROR_WANT_READ)
{
// We have to wait until our socket is readable, and then repeat our SSL_write() call.
_sslState |= SSL_STATE_WRITE_WANTS_READABLE_SOCKET;
_sslState &= ~SSL_STATE_WRITE_WANTS_WRITEABLE_SOCKET;
bytes = 0;
}
else if (err == SSL_ERROR_WANT_WRITE)
{
// We have to wait until our socket is writeable, and then repeat our SSL_write() call.
_sslState &= ~SSL_STATE_WRITE_WANTS_READABLE_SOCKET;
_sslState |= SSL_STATE_WRITE_WANTS_WRITEABLE_SOCKET;
bytes = 0;
}
else
{
fprintf(stderr,"SSL_write() ERROR!");
ERR_print_errors_fp(stderr);
}
}
return bytes;
}
Someone on the openssl-users mailing list helped me figure this out; the problem was that I was setting up my SSL session with SSLv23_method(), and when using SSLv23_method(), you mustn't call SSL_pending() until after the SSL handshake has finished negotiating which protocol (SSLv2, SSLv3, TLSv1, etc) it's actually going to use.
Since my application doesn't require compatibility with older versions of SSL, the quick work-around for me is to call SSLv3_method() during setup instead of SSLv23_method(). If backwards compatibility was needed, then I'd need to figure out some way of detecting when the protocol negotiation had completed and avoid calling SSL_pending() until then; but I'm going to ignore the issue for now since I don't need that functionality.

Socket can't accept connections when non-blocking?

EDIT: Messed up my pseudo-coding of the accept call, it now reflects what I'm actually doing.
I've got two sockets going. I'm trying to use send/recv between the two. When the listening socket is blocking, it can see the connection and receive it. When it's nonblocking, I put a busy wait in (just to debug this) and it times out, always with the error EWOULDBLOCK. Why would the listening socket not be able to see a connection that it could see when blocking?
The code is mostly separated in functions, but here's some pseudo-code of what I'm doing.
int listener = -2;
int connector = -2;
int acceptedSocket = -2;
getaddrinfo(port 27015, AI_PASSIVE) results loop for listener socket
{
if (listener socket() == 0)
{
if (listener bind() == 0)
if (listener listen() == 0)
break;
listener close(); //if unsuccessful
}
}
SetBlocking(listener, false);
getaddrinfo("localhost", port 27015) results loop for connector socket
{
if (connector socket() == 0)
{
if (connector connect() == 0)
break; //if connect successful
connector close(); //if unsuccessful
}
}
loop for 1 second
{
acceptedSocket = listener accept();
if (acceptedSocket > 0)
break; //if successful
}
This just outputs a huge list errno of EWOULDBLOCK before ultimately ending the timeout loop. If I output the file descriptor for the accepted socket in each loop interation, it is never assigned a file descriptor.
The code for SetBlocking is as so:
int SetBlocking(int sockfd, bool blocking)
{
int nonblock = !blocking;
return ioctl(sockfd,
FIONBIO,
reinterpret_cast<int>(&nonblock));
}
If I use a blocking socket, either by calling SetBlocking(listener, true) or removing the SetBlocking() call altogether, the connection works no problem.
Also, note that this connection with the same implementation works in Windows, Linux, and Solaris.
Because of the tight loop you are not letting the OS complete your request. That's the difference between VxWorks and others - you basically preempt your kernel.
Use select(2) or poll(2) to wait for the connection instead.