I am working on a networking program using epoll on linux machine and I got the error message from gdb.
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7ffff609a700 (LWP 19788)]
0x00007ffff7bcdb2d in write () from /lib/libpthread.so.0
(gdb)
(gdb) backtrace
#0 0x00007ffff7bcdb2d in write () from /lib/libpthread.so.0
#1 0x0000000000416bc8 in WorkHandler::workLoop() ()
#2 0x0000000000416920 in WorkHandler::runWorkThread(void*) ()
#3 0x00007ffff7bc6971 in start_thread () from /lib/libpthread.so.0
#4 0x00007ffff718392d in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()
My server doing n^2 time calculation and I tried to run the server with 500 connected users. What might cause this error? and how do I fix this?
while(1){
if(remainLength >= MAX_LENGTH)
currentSentLength = write(client->getFd(), sBuffer, MAX_LENGTH);
else
currentSentLength = write(client->getFd(), sBuffer, remainLength);
if(currentSentLength == -1){
log("WorkHandler::workLoop, connection has been lost \n");
break;
}
sBuffer += currentSentLength;
remainLength -= currentSentLength;
if(remainLength == 0)
break;
}
When you write to a pipe that has been closed (by the remote end) , your program will receive this signal. For simple command-line filter programs, this is often an appropriate default action, since the default handler for SIGPIPE will terminate the program.
For a multithreaded program, the correct action is usually to ignore the SIGPIPE signal, so that writing to a closed socket will not terminate the program.
Note that you cannot successfully perform a check before writing, since the remote end may close the socket in between your check and your call to write().
See this question for more information on ignoring SIGPIPE: How to prevent SIGPIPEs (or handle them properly)
You're not catching SIGPIPE signals, but you're trying to write to a pipe that's been broken/closed.
Fairly self-explanatory.
It's usually sufficient to handle SIGPIPE signals as a no-op, and handle the error case around your write call in whatever application-specific manner you require... like this.
Related
I'm coding a C++ SSL Server for TCP Connections on Linux.
When the program uses SSL_write() to write into a closed pipe, a SIGPIPE-Exception gets thrown which causes the program to shut down. I know that this is normal behaviour. But the program should not always die when the peer not closes the connection correctly.
I have already googled a lot and tried pretty much everything I found, but it seems like nothing is working for me. signal(SIGPIPE,SIG_IGN) does not work - the exception still gets thrown (Same for signal(SIGPIPE, SomeKindOfHandler).
The gdb output:
Program received signal SIGPIPE, Broken pipe.
0x00007ffff6b23ccd in write () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) where
#0 0x00007ffff6b23ccd in write () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007ffff7883835 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#2 0x00007ffff7881687 in BIO_write () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#3 0x00007ffff7b9d3e0 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#4 0x00007ffff7b9db04 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#5 0x000000000042266a in NetInterface::SendToSubscribers(bool) () at ../Bether/NetInterface.h:181
#6 0x0000000000425834 in main () at ../Bether/main.cpp:111
About the Code:
I'm using a thread which is waiting for new connections and accepting them. The thread then puts the connection information (BIO & SSL) into a static map inside the NetInterface class.
Every 5 seconds NetInterface::sendTOSubscribers() is executed from main(). This function accesses the static map and sends data to every connection in there. This function is also where the SIGPIPE comes from.
I have used signal(SIGPIPE,SIG_IGN) in main() (obviously before the 5-seconds loop) and in NetInterface::SendToSubscribers(), but it is not working anywhere.
Thanks for your help!
You have to call function sigaction to change this behavior either to ignore SIGPIPE or handle it in a specific way with your own signal handler. Please don't use function signal, it's obsolete.
http://man7.org/linux/man-pages/man2/sigaction.2.html
One way to do it (I haven't compiled this code but should be something like this):
void sigpipe_handler(int signal)
{
...
}
int main()
{
struct sigaction sh;
struct sigaction osh;
sh.sa_handler = &sigpipe_handler; //Can set to SIG_IGN
// Restart interrupted system calls
sh.sa_flags = SA_RESTART;
// Block every signal during the handler
sigemptyset(&sh.sa_mask);
if (sigaction(SIGPIPE, &sh, &osh) < 0)
{
return -1;
}
...
}
If the program is multithreaded, it is a little different as you have less control on which thread will receive the signal. That depends on the type of signal. For SIGPIPE, it will be sent to the pthread that generated the signal. Nevertheless, sigaction should work OK.
It is possible to set the mask in the main thread and all subsequently created pthreads will inherit the signal mask. Otherwise, the signal mask can be set in each thread.
sigset_t blockedSignal;
sigemptyset(&blockedSignal);
sigaddset(&blockedSignal, SIGPIPE);
pthread_sigmask(SIG_BLOCK, &blockedSignal, NULL);
However, if you block the signal, it will be pending for the process and as soon as it is possible it will be delivered. For this case, use sigtimedwait at the end of the thread. sigaction set at the main thread or in the thread that generated SIGPIPE should work as well.
I've found the solution, it works with pthread_sigmask.
sigset_t set;
sigemptyset(&set);
sigaddset(&set, SIGPIPE);
if (pthread_sigmask(SIG_BLOCK, &set, NULL) != 0)
return -1;
Thanks to everyone for the help!
I'm writing a http/https client using openssl-0.9.8e
I get error when I call SSL_read()
then, I call SSL_get_error get SSL_ERROR_SYSCALL and errno ECONNRESET 104 /* Connection reset by peer */
accoring to SSL documetation that's what it means:
SSL_ERROR_SYSCALL
Some I/O error occurred. The OpenSSL error queue may contain more information on the error.
If the error queue is empty (i.e. ERR_get_error() returns 0), ret can be used to find out more about the
error: If ret == 0, an EOF was observed that violates the protocol. If ret == -1, the underlying BIO
reported an I/O error (for socket I/O on Unix systems, consult errno for details).
well, Connection reset, I call SSL_shutdown to close connection, oh,Program received signal SIGPIPE, Broken pipe.
God, I call signal(SIGPIPE, SIG_IGN); to ignore "SIGPIPE" signal,but it seem to does't work~
A Segmentation Fault happen
#0 0x00000032bd00d96b in write () from /lib64/libpthread.so.0
#1 0x0000003add478367 in ?? () from /lib64/libcrypto.so.6
#2 0x0000003add4766fe in BIO_write () from /lib64/libcrypto.so.6
#3 0x0000003add8208fd in ssl3_write_pending () from /lib64/libssl.so.6
#4 0x0000003add820d9a in ssl3_dispatch_alert () from /lib64/libssl.so.6
#5 0x0000003add81e982 in ssl3_shutdown () from /lib64/libssl.so.6
#6 0x00000000004565d0 in CWsPollUrl::SSLClear (this=<value optimized out>, ctx=0x2aaab804a1b0, ssl=0x2aaab804a680)
at ../src/Wspoll.cpp:1122
#7 0x00000000004575e0 in CWsPollUrl::asyncEventDelete (this=0x4d422e50, eev=0x2aaab8001160) at ../src/Wspoll.cpp:1546
#8 0x000000000045928a in CWsPollUrl::onFail (this=0x4d422e50, eev=0x2aaab8001160, errorCode=4) at ../src/Wspoll.cpp:1523
#9 0x000000000045ab17 in CWsPollUrl::handleData (this=0x4d422e50, eev=0x2aaab8001160, len=<value optimized out>) at ../src/Wspoll.cpp:1259
#10 0x000000000045abcc in CWsPollUrl::asyncRecvEvent (this=0x4d422e50, fd=<value optimized out>, eev=0x2aaab8001160)
at ../src/Wspoll.cpp:1211
#11 0x00000000004636b5 in event_base_loop (base=0x14768360, flags=0) at event.c:1350
#12 0x0000000000456a62 in CWsPollUrl::run (this=<value optimized out>, param=<value optimized out>) at ../src/Wspoll.cpp:461
#13 0x0000000000436c5c in doPollUrl (data=<value optimized out>, user_data=<value optimized out>) at ../src/PollStrategy.cpp:151
#14 0x00000032bf44a95d in ?? () from /lib64/libglib-2.0.so.0
#15 0x00000032bf448e04 in ?? () from /lib64/libglib-2.0.so.0
#16 0x00000032bd00677d in start_thread () from /lib64/libpthread.so.0
#17 0x00000032bc4d3c1d in clone () from /lib64/libc.so.6
why I get SIGPIPE signal, I have already called signal(SIGPIPE, SIG_IGN);
Does anyone know why?
Thanks in advance
If you get an I/O error with SSL_read it makes not much sense to call SSL_shutdown, because the shutdown tries to send a "close notify" shutdown alert to the peer and this will obviously not work on a broken connection. Therefore you get the SIGPIPE or EPIPE. Getting ECONNRESET from SSL_read in this case probably means, that the client has closed the connection hard, e.g. without doing an SSL_shutdown. You should not continue working with the socket after the error, e.g. not even doing an SSL_shutdown.
In addition to the #SteffenUllrich answer you can also call SSL_get_shutdown before calling SSL_shutdown and check if the SSL_SENT_SHUTDOWN flag was already set. You can do something like this:
//Perform a mutex lock here
if(SSL_get_shutdown(ssl) & SSL_SENT_SHUTDOWN)
{
printf("shutdown request ignored\n");
}
else
{
SSL_shutdown(con->tls.openssl);
}
//Perform a mutex unlock here
In a multi threaded program with the SSL * pointer shared among multiple threads it may happen that the SSL_shutdown was already called by another thread and this code can protect you from a SIGPIPE signal.
I have written a Qt5/C++ program which forks and runs in the background, and stops in response to a signal and shuts down normally. All sounds great, but when I "ps ax | grep myprog" I see a bunch of my programs still running; eg:
29244 ? Ss 149:47 /usr/local/myprog/myprog -q
30913 ? Ss 8:37 /usr/local/myprog/myprog -q
32484 ? Ss 0:11 /usr/local/myprog/myprog -q
If I run the program in the foreground then the process does NOT hang around on the process list - it dies off as expected. This only happens when in the background. Why?
Update: I found that my program is in futex_wait_queue_me state (queue_me and wait for wakeup, timeout, or signal). I do have 3 seperate threads - and that may be related. So I attached a debugger to one of the waiting processes and found this:
(gdb) bt
#0 0x000000372460b575 in pthread_cond_wait##GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f8990fb454b in QWaitCondition::wait(QMutex*, unsigned long) ()
from /opt/Qt/5.1.1/gcc_64/lib/libQt5Core.so.5
#2 0x00007f8990fb3b3e in QThread::wait(unsigned long) () from /opt/Qt/5.1.1/gcc_64/lib/libQt5Core.so.5
#3 0x00007f8990fb0402 in QThreadPoolPrivate::reset() () from /opt/Qt/5.1.1/gcc_64/lib/libQt5Core.so.5
#4 0x00007f8990fb0561 in QThreadPool::waitForDone(int) () from /opt/Qt/5.1.1/gcc_64/lib/libQt5Core.so.5
#5 0x00007f89911a4261 in QMetaObject::activate(QObject*, int, int, void**) ()
from /opt/Qt/5.1.1/gcc_64/lib/libQt5Core.so.5
#6 0x00007f89911a4d5f in QObject::destroyed(QObject*) () from /opt/Qt/5.1.1/gcc_64/lib/libQt5Core.so.5
#7 0x00007f89911aa3ee in QObject::~QObject() () from /opt/Qt/5.1.1/gcc_64/lib/libQt5Core.so.5
#8 0x0000000000409d8b in main (argc=1, argv=0x7fffba44c8f8) at ../../src/main.cpp:27
(gdb)
Update:
I commented out my 2 threads, so only the main thread runs now, and the problem is the same.
Is there a special way to cause a background process to exit? Why won't the main thread shutdown?
Update:
Solved - Qt does not like Fork. (See another StackExchane questoin). I had to move my fork to the highest level (before Qt does anything), and then Qt doesn't hang on exit.
http://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE%20CODES
PROCESS STATE CODES
Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process:
D uninterruptible sleep (usually IO)
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct ("zombie") process, terminated but not reaped by its parent.
For BSD formats and when the stat keyword is used, additional characters may be displayed:
< high-priority (not nice to other users)
N low-priority (nice to other users)
L has pages locked into memory (for real-time and custom IO)
s is a session leader
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
+ is in the foreground process group.
So your processes are all in "S: interruptible sleep." That is, they are all waiting for blocking syscalls.
You might have better hints on what your programs are waiting for from this command:
$ ps -o pid,stat,wchan `pidof zsh`
PID STAT WCHAN
4490 Ss rt_sigsuspend
4814 Ss rt_sigsuspend
4861 Ss rt_sigsuspend
4894 Ss+ n_tty_read
5744 Ss+ n_tty_read
...
"wchan (waiting channel)" shows a kernel function (=~ syscall) which is blocking.
See also
https://askubuntu.com/questions/19442/what-is-the-waiting-channel-of-a-process
I have a jabber server application an another jabber client application in C++.
When the client receive and send a lot of messages (more than 20 per second), this comes that the select just freeze and never return.
With netstat, the socket is still connected on linux and with tcpdump, the message is still send to the client but the select just never return.
Here is the code that select :
bool ConnectionTCPBase::dataAvailable( int timeout )
{
if( m_socket < 0 )
return true; // let recv() catch the closed fd
fd_set fds;
struct timeval tv;
FD_ZERO( &fds );
// the following causes a C4127 warning in VC++ Express 2008 and possibly other versions.
// however, the reason for the warning can't be fixed in gloox.
FD_SET( m_socket, &fds );
tv.tv_sec = timeout / 1000000;
tv.tv_usec = timeout % 1000000;
return ( ( select( m_socket + 1, &fds, 0, 0, timeout == -1 ? 0 : &tv ) > 0 )
&& FD_ISSET( m_socket, &fds ) != 0 );
}
And the deadlock is with gdb:
Thread 2 (Thread 0x7fe226ac2700 (LWP 10774)):
#0 0x00007fe224711ff3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000000004706a9 in gloox::ConnectionTCPBase::dataAvailable (this=0xcaeb60, timeout=<value optimized out>) at connectiontcpbase.cpp:103
#2 0x000000000046c4cb in gloox::ConnectionTCPClient::recv (this=0xcaeb60, timeout=10) at connectiontcpclient.cpp:131
#3 0x0000000000471476 in gloox::ConnectionTLS::recv (this=0xd1a950, timeout=648813712) at connectiontls.cpp:89
#4 0x00000000004324cc in glooxd::C2S::recv (this=0xc5d120, timeout=10) at c2s.cpp:124
#5 0x0000000000435ced in glooxd::C2S::run (this=0xc5d120) at c2s.cpp:75
#6 0x000000000042d789 in CNetwork::run (this=0xc56df0) at src/Network.cpp:343
#7 0x000000000043115f in threading::ThreadManager::threadWorker (data=0xc56e10) at src/ThreadManager.cpp:15
#8 0x00007fe2249bc9ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#9 0x00007fe22471970d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()
Do you know what can cause a select to stop receiving messages even if we are still sending to him.
Is there any buffer limit in linux when receiving and sending a lot of messages through the socket ?
Thanks
There are several possibilities.
Exceeding FD_SETSIZE
Your code is checking for a negative file descriptor, but not for exceeding the upper limit which is FD_SETSIZE (typically 1024). Whenever that happens, your code is
corrupting its own stack
presenting an empty fd_set to the select which will cause a hang
Supposing that you do not need so many concurrently open file descriptors, the solution would probably consist in finding a removing a file descriptor leak, especially the code up the stack that handles closing of abandoned descriptors.
There is a suspicious comment in your code that indicates a possible leak:
// let recv() catch the closed fd
If this comment means that somebody sets m_socket to -1 and hopes that a recv will catch the closed socket and close it, who knows, maybe we are closing -1 and not the real closed socket. (Note the difference between closing on network level and closing on file descriptor level which requires a separate close call.)
This could also be treated by moving to poll but there are a few other limits imposed by the operating system that make this route quite challenging.
Out of band data
You say that the server is "sending" data. If that means that the data is sent using the send call (as opposed to a write call), use strace to determine the send flags argument. If MSG_OOB flag is used, the data is arriving as out of band data - and your select call will not notice those until you pass a copy of fds as another parameter.
fd_set fds_copy = fds;
select( m_socket + 1, &fds, 0, &fds_copy, timeout == -1 ? 0 : &tv )
Process starvation
If the box is heavily overloaded, the server is executing without any blocking calls, and with a real time priority (use top to check on that) - and the client is not - the client might be starved.
Suspended process
The client might theoretically be stopped with a SIGSTOP. You would probably know if this is the case, having pressed somewhere ctrl-Z or having some particular process exercising control on the client other than you starting it yourself.
I have a simple use case for a Thrift Server(TSimpleServer) wherein I have a couple of threads spawned(besides the main thread). One of the newly spawned threads enters the Thrift event loop (i.e server.serve()). Upon receiving a signal in the main thread I invoke server.stop() which is causing the error posted below.
At first I thought it was an uncaught exception. However wrapping both the invocations of server.serve() and server.stop() in try-catch'es didn't help isolate the problem. Any thoughts/suggestions(on what I should be doing)? Most Thrift tutorials/guides/examples seem to talk about server start but don't seem to mention the stop scenario, any pointers/best-practices/suggestions in this regard would be great. Thanks.
Also, I am using thrift-0.7.0.
Error details:
Thrift: Fri Nov 18 21:22:47 2011 TServerTransport died on accept: TTransportExc\
eption: Interrupted
*** glibc detected *** ./build/mc_daemon: munmap_chunk(): invalid poi\
nter: 0x0000000000695f18 ***
Segmentation fault (core dumped)
Also here's the stack-trace:
#0 0x00007fb751c92f08 in ?? () from /lib/libc.so.6
#1 0x00007fb7524bb0eb in apache::thrift::server::TSimpleServer::serve (
this=0x1e5bca0) at src/server/TSimpleServer.cpp:140
#2 0x000000000046ce15 in a::b::server_thread::operator() (
this=0x695f18)
at /path/to/server_thread.cpp:80
#3 0x000000000046c1a9 in boost::detail::thread_data<boost::reference_wrapper<a\
ds::data_load::server_thread> >::run (this=0x1e5bd80)
at /usr/include/boost/thread/detail/thread.hpp:81
#4 0x00007fb7526f2b70 in thread_proxy ()
from /usr/lib/libboost_thread.so.1.40.0
#5 0x00007fb7516fd9ca in start_thread () from /lib/libpthread.so.0
#6 0x00007fb7519fa70d in clone () from /lib/libc.so.6
#7 0x0000000000000000 in ?? ()
Edit 1: I have added pseudo-code for the main thread, the thrift server thread and the background thread.
Edit 2: I seem to have resolved the original issue as noted in my answer below. However this solution leads to two rather undesirable/questionable design choices: (i) I had to introduce a thrift endpoint to enable a mechanism to stop the server (ii) The handler class for the thrift service(which is usually required to instantiate a server object) now requires a means to signal back to the server to stop, introducing a circular dependency of sorts.
Any suggestions on these design issues/choices would be greatly appreciated.
My problem seems to have stemmed from my code/design wherein I had signal-handler code in the main thread invoking stop on the server which was started in a 'server thread'. Changing this behavior(as noted in the pastebin code-snippets) helped resolve this issue.