Possible causes of a deadlock in socket select - c++

I have a jabber server application an another jabber client application in C++.
When the client receive and send a lot of messages (more than 20 per second), this comes that the select just freeze and never return.
With netstat, the socket is still connected on linux and with tcpdump, the message is still send to the client but the select just never return.
Here is the code that select :
bool ConnectionTCPBase::dataAvailable( int timeout )
{
if( m_socket < 0 )
return true; // let recv() catch the closed fd
fd_set fds;
struct timeval tv;
FD_ZERO( &fds );
// the following causes a C4127 warning in VC++ Express 2008 and possibly other versions.
// however, the reason for the warning can't be fixed in gloox.
FD_SET( m_socket, &fds );
tv.tv_sec = timeout / 1000000;
tv.tv_usec = timeout % 1000000;
return ( ( select( m_socket + 1, &fds, 0, 0, timeout == -1 ? 0 : &tv ) > 0 )
&& FD_ISSET( m_socket, &fds ) != 0 );
}
And the deadlock is with gdb:
Thread 2 (Thread 0x7fe226ac2700 (LWP 10774)):
#0 0x00007fe224711ff3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000000004706a9 in gloox::ConnectionTCPBase::dataAvailable (this=0xcaeb60, timeout=<value optimized out>) at connectiontcpbase.cpp:103
#2 0x000000000046c4cb in gloox::ConnectionTCPClient::recv (this=0xcaeb60, timeout=10) at connectiontcpclient.cpp:131
#3 0x0000000000471476 in gloox::ConnectionTLS::recv (this=0xd1a950, timeout=648813712) at connectiontls.cpp:89
#4 0x00000000004324cc in glooxd::C2S::recv (this=0xc5d120, timeout=10) at c2s.cpp:124
#5 0x0000000000435ced in glooxd::C2S::run (this=0xc5d120) at c2s.cpp:75
#6 0x000000000042d789 in CNetwork::run (this=0xc56df0) at src/Network.cpp:343
#7 0x000000000043115f in threading::ThreadManager::threadWorker (data=0xc56e10) at src/ThreadManager.cpp:15
#8 0x00007fe2249bc9ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#9 0x00007fe22471970d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()
Do you know what can cause a select to stop receiving messages even if we are still sending to him.
Is there any buffer limit in linux when receiving and sending a lot of messages through the socket ?
Thanks

There are several possibilities.
Exceeding FD_SETSIZE
Your code is checking for a negative file descriptor, but not for exceeding the upper limit which is FD_SETSIZE (typically 1024). Whenever that happens, your code is
corrupting its own stack
presenting an empty fd_set to the select which will cause a hang
Supposing that you do not need so many concurrently open file descriptors, the solution would probably consist in finding a removing a file descriptor leak, especially the code up the stack that handles closing of abandoned descriptors.
There is a suspicious comment in your code that indicates a possible leak:
// let recv() catch the closed fd
If this comment means that somebody sets m_socket to -1 and hopes that a recv will catch the closed socket and close it, who knows, maybe we are closing -1 and not the real closed socket. (Note the difference between closing on network level and closing on file descriptor level which requires a separate close call.)
This could also be treated by moving to poll but there are a few other limits imposed by the operating system that make this route quite challenging.
Out of band data
You say that the server is "sending" data. If that means that the data is sent using the send call (as opposed to a write call), use strace to determine the send flags argument. If MSG_OOB flag is used, the data is arriving as out of band data - and your select call will not notice those until you pass a copy of fds as another parameter.
fd_set fds_copy = fds;
select( m_socket + 1, &fds, 0, &fds_copy, timeout == -1 ? 0 : &tv )
Process starvation
If the box is heavily overloaded, the server is executing without any blocking calls, and with a real time priority (use top to check on that) - and the client is not - the client might be starved.
Suspended process
The client might theoretically be stopped with a SIGSTOP. You would probably know if this is the case, having pressed somewhere ctrl-Z or having some particular process exercising control on the client other than you starting it yourself.

Related

epoll does not signal an event when socket is close

I have a listener socket, every new connection I get I add it to epoll like this:
int connfd = accept(listenfd, (struct sockaddr *)&clnt_addr, &clnt_addr_len);
ev.events = EPOLLIN | EPOLLET | EPOLLONESHOT | EPOLLHUP;
ev.data.fd = connfd;
epoll_ctl(epollfd, EPOLL_CTL_ADD, connfd, &ev)
When new data is received, epoll signal 'EPOLLIN' event - as expected.
In such a situation I read all the information as follows:
long read = 0;
do {
read = recv(events[n].data.fd, buffer, sizeof (buffer), 0);
} while (read > 0);
In case I disconnected brutally or normally, epoll does not signal an event.
This code run in each thread, that's what I'm using EPOLLET.
So my question:
What do I need to do to get this event?
What do I need to do to close the socket so that there is no leakage of resources?
There are a few problems with your attempt.
You should not use EPOLLONESHOT unless you know what you are doing and you really need it. It disables the report of any other events to the epoll instance until you enable it again with EPOLL_CTL_MOD.
You should not use EPOLLHUP to determine if a connection was closed. The EPOLLHUP event may be raised before all data is read from the socket input stream, even if the client disconnects gracefully. I recommend you to only use EPOLLIN. If there is no input left (because of forceful or graceful disconnect), the read() would return 0 (EOF) and you can close the socket.
Your read() call will block the reading thread and consume the whole stream until EOF (connection closed). The whole point in using epoll() is to not have to use a while ( read(...) > 0 ) loop.
You should not use EPOLLET because "the code runs multithreaded" but because you need it. You can write multithreaded code without the edge-triggered mode. The use of EPOLLET requires a thorough knowledge of the differences between blocking and non-blocking I/O. You can easily run into pitfalls (as mentioned in the manual), like edge-triggered starvation.

Can I write to a closed socket and forcefully correct the broken pipe error?

I have an application that runs on a large number of processors. On processor 0, I have a function that writes data to a socket if it is open. This function runs in a loop in a separate thread on processor 0, i.e. processor 0 is responsible for its own workload and has an extra thread running the communication on the socket.
//This function runs on a loop, called every 1.5 seconds
void T_main_loop(const int& client_socket_id, bool* exit_flag)
{
//Check that socket still connected.
int error_code;
socklen_t error_code_size = sizeof(error_code);
getsockopt(client_socket_id, SOL_SOCKET, SO_ERROR, &error_code, &error_code_size);
if (error_code == 0)
{
//send some data
int valsend = send(client_socket_id , data , size_of_data , 0);
}
else
{
*(exit_flag) = false; //This is used for some external logic.
//Can I fix the broklen pipe here somehow?
}
}
When the client socket is closed, the program should just ignore the error, and this is standard behavior as far as I am aware.
However, I am using an external library (PETSc) that is somehow detecting the broken pipe error and closing the entire parallel (MPI) environment:
[0]PETSC ERROR: Caught signal number 13 Broken Pipe: Likely while reading or writing to a socket
I would like to leave the configuration of this library completely untouched if at all possible. Open to any robust workarounds that are possible.
By default, the OS sends the thread SIGPIPE if it tries to write into a (half) closed pipe or socket.
One option to disable the signal is to do signal(SIGPIPE, SIG_IGN);.
Another option is to use MSG_NOSIGNAL flag for send, e.g. send(..., MSG_NOSIGNAL);.

Recv() call hangs after remote host terminates

My problem is that I have a thread that is in a recv() call. The remote host suddenly terminates (without a close() socket call) and the recv() call continues to block. This is obviously not good because when I am joining the threads to close the process (locally) this thread will never exit because it is waiting on a recv that will never come.
So my question is what method do people generally consider to be the best way to deal with this issue? There are some additional things of note that should be known before answering:
There is no way for me to ensure that the remote host closes the socket prior to exit.
This solution cannot use external libraries (such as boost). It must use standard libraries/features of C++/C (preferably not C++0x specific).
I know this has likely been asked in the past but id like to get someones take as to how to correct this issue properly (without doing something super hacky which I would have done in the past).
Thanks!
Assuming you want to continue to use blocking sockets, you can use the SO_RCVTIMEO socket option:
SO_RCVTIMEO and SO_SNDTIMEO
Specify the receiving or sending timeouts until reporting an
error. The parameter is a struct timeval. If an input or out-
put function blocks for this period of time, and data has been
sent or received, the return value of that function will be the
amount of data transferred; if no data has been transferred and
the timeout has been reached then -1 is returned with errno set
to EAGAIN or EWOULDBLOCK just as if the socket was specified to
be nonblocking. If the timeout is set to zero (the default)
then the operation will never timeout.
So, before you begin receiving:
struct timeval timeout = { timo_sec, timo_usec };
int r = setsockopt(s, SOL_SOCKET, SO_RCVTIMEO, &timeout, sizeof(timeout));
assert(r == 0); /* or something more user friendly */
If you are willing to use non-blocking I/O, then you can use poll(), select(), epoll(), kqueue(), or whatever the appropriate event dispatching mechanism is for your system. The reason you need to use non-blocking I/O is that you need to allow the system call to recv() to return to notify you that there is no data in the socket's input queue. The example to use is a little bit more involved:
for (;;) {
ssize_t bytes = recv(s, buf, sizeof(buf), MSG_DONTWAIT);
if (bytes > 0) { /* ... */ continue; }
if (bytes < 0) {
if (errno == EWOULDBLOCK) {
struct pollfd p = { s, POLLIN, 0 };
int r = poll(&p, 1, timo_msec);
if (r == 1) continue;
if (r == 0) {
/*...handle timeout */
/* either continue or break, depending on policy */
}
}
/* ...handle errors */
break;
}
/* connection is closed */
break;
}
You can use TCP keep-alive probes to detect if the remote host is still reachable. When keep-alive is enabled, the OS will send probes if the connection has been idle for too long; if the remote host doesn't respond to the probes, then the connection is closed.
On Linux, you can enable keep-alive probes by setting the SO_KEEPALIVE socket option, and you can configure the parameters of the keep-alive with the TCP_KEEPCNT, TCP_KEEPIDLE, and TCP_KEEPINTVL socket options. See tcp(7) and socket(7) for more info on those.
Windows also uses the SO_KEEPALIVE socket option for enabling keep-alive probes, but for configuring the keep-alive parameters, use the SIO_KEEPALIVE_VALS ioctl.
You could use select()
From http://linux.die.net/man/2/select
int select(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, struct timeval *timeout);
select() blocks until the first event (read ready, write ready, or exception) on one or more file descriptors or a timeout occurs.
sockopts and select are probably the ideal choices. An additional option that you should consider as a backup is to send your process a signal (for example using the alarm() call). This should force any syscall in progress to exit and set errno to EINTR.

c++ linux accept() blocking after socket closed

I have a thread that listens for new connections
new_fd = accept(Listen_fd, (struct sockaddr *) & their_addr, &sin_size);
and another thread that closes Listen_fd when when it's time to close the program. After Listen_fd is closed however, it still blocks. When I use GDB to try and debug accept() doesn't block. I thought that it could be a problem with SO_LINGER, but it shouldn't be on by default, and shouldn't change when using GDB. Any idea whats going on, or any other suggestion to closing the listing socket?
Use: sock.shutdown (socket.SHUT_RD)
Then accept will return EINVAL. No ugly cross thread signals required!
From the Python documentation:
"Note close() releases the resource associated with a connection but does not necessarily close the connection immediately. If you want to close the connection in a timely fashion, call shutdown() before close()."
http://docs.python.org/3/library/socket.html#socket.socket.close
I ran into this problem years ago, while programming in C. But I only found the solution today, after running into the same problem in Python, AND pondering using signals (yuck!), AND THEN remembering the note about shutdown!
As for the comments that say you should not close/use sockets across threads... in CPython the global interpreter lock should protect you (assuming you are using file objects rather than raw, integer file descriptors).
Here is example code:
import socket, threading, time
sock = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt (socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind (('', 8000))
sock.listen (5)
def child ():
print ('child accept ...')
try: sock.accept ()
except OSError as exc : print ('child exception %s' % exc)
print ('child exit')
threading.Thread ( target = child ).start ()
time.sleep (1)
print ('main shutdown')
sock.shutdown (socket.SHUT_RD)
time.sleep (1)
print ('main close')
sock.close ()
time.sleep (1)
print ('main exit')
The behavior of accept when called on something which is not a valid socket FD is undefined. "Not a valid socket FD" includes numbers which were once valid sockets but have since been closed. You might say "but Borealid, it's supposed to return EINVAL!", but that's not guaranteed - for instance, the same FD number might be reassigned to a different socket between your close and accept calls.
So, even if you were to isolate and correct whatever makes your program fail, you could still begin to fail again in the future. Don't do it - correct the error that causes you to attempt to accept a connection on a closed socket.
If you meant that a call which was previously made to accept continues blocking after close, then what you should do is send a signal to the thread which is blocked in accept. This will give it EINTR and it can cleanly disengage - and then close the socket. Don't close it from a thread other than the one using it.
The shutdown() function may be what you are looking for. Calling shutdown(Listen_fd, SHUT_RDWR) will cause any blocked call to accept() to return EINVAL. Coupling a call to shutdown() with the use of an atomic flag can help to determine the reason for the EINVAL.
For example, if you have this flag:
std::atomic<bool> safe_shutdown(false);
Then you can instruct the other thread to stop listening via:
shutdown_handler([&]() {
safe_shutdown = true;
shutdown(Listen_fd, SHUT_RDWR);
});
For completeness, here's how your thread could call accept:
while (true) {
sockaddr_in clientAddr = {0};
socklen_t clientAddrSize = sizeof(clientAddr);
int connSd = accept(Listen_fd, (sockaddr *)&clientAddr, &clientAddrSize);
if (connSd < 0) {
// If shutdown_handler() was called, then exit gracefully
if (errno == EINVAL && safe_shutdown)
break;
// Otherwise, it's an unrecoverable error
std::terminate();
}
char clientname[1024];
std::cout << "Connected to "
<< inet_ntop(AF_INET, &clientAddr.sin_addr, clientname,
sizeof(clientname))
<< std::endl;
service_connection(connSd);
}
It's a workaround, but you could select on Listen_fd with a timeout, and if a timeout occured check that you're about to close the program. If so, exit the loop, if not, go back to step 1 and do the next select.
Are you checking the return value of close?
From linux manpages, (http://www.kernel.org/doc/man-pages/online/pages/man2/close.2.html)
"It is probably unwise to close file descriptors while they may be in use by system calls in other threads in the same process. Since a file descriptor may be reused, there are some obscure race conditions that may cause unintended side effects".
You can use a select instead of an accept and wait for some event from the other thead, then close the socket in the listener thread.

SIGPIPE, Broken pipe

I am working on a networking program using epoll on linux machine and I got the error message from gdb.
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7ffff609a700 (LWP 19788)]
0x00007ffff7bcdb2d in write () from /lib/libpthread.so.0
(gdb)
(gdb) backtrace
#0 0x00007ffff7bcdb2d in write () from /lib/libpthread.so.0
#1 0x0000000000416bc8 in WorkHandler::workLoop() ()
#2 0x0000000000416920 in WorkHandler::runWorkThread(void*) ()
#3 0x00007ffff7bc6971 in start_thread () from /lib/libpthread.so.0
#4 0x00007ffff718392d in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()
My server doing n^2 time calculation and I tried to run the server with 500 connected users. What might cause this error? and how do I fix this?
while(1){
if(remainLength >= MAX_LENGTH)
currentSentLength = write(client->getFd(), sBuffer, MAX_LENGTH);
else
currentSentLength = write(client->getFd(), sBuffer, remainLength);
if(currentSentLength == -1){
log("WorkHandler::workLoop, connection has been lost \n");
break;
}
sBuffer += currentSentLength;
remainLength -= currentSentLength;
if(remainLength == 0)
break;
}
When you write to a pipe that has been closed (by the remote end) , your program will receive this signal. For simple command-line filter programs, this is often an appropriate default action, since the default handler for SIGPIPE will terminate the program.
For a multithreaded program, the correct action is usually to ignore the SIGPIPE signal, so that writing to a closed socket will not terminate the program.
Note that you cannot successfully perform a check before writing, since the remote end may close the socket in between your check and your call to write().
See this question for more information on ignoring SIGPIPE: How to prevent SIGPIPEs (or handle them properly)
You're not catching SIGPIPE signals, but you're trying to write to a pipe that's been broken/closed.
Fairly self-explanatory.
It's usually sufficient to handle SIGPIPE signals as a no-op, and handle the error case around your write call in whatever application-specific manner you require... like this.