As the question states, is it useful to always collect a software-based backtrace (like using libc backtrace http://www.gnu.org/software/libc/manual/html_node/Backtraces.html ) in all error functions and signal handlers ?
Would it not be very helpful for debugging a wide variety of bugs like memory, concurrency bugs etc. ? I guess it would not hurt normal performance as well as it will be triggered only in error paths.
As the question states, is it useful to always collect a software-based backtrace
Yes, it is generally very useful to have a crash stack trace when:
your code runs in your own environment, and you are not worried about the stack trace revealing any secrets.
when the crash handler does not further corrupt the coredump, does not hang, etc.
like using libc backtrace
glibc backtrace calls calloc under certain conditions, and is not safe in a crash handler. It can cause both the hang, and the further corruption mentioned above. Writing a crash handler that will reliably print stack trace in async-signal-safe manner is quite non-trivial.
why then do error functions in "standard" applications not call backtrace?
Consider cat /no/such/file. Currently it produces:
cat: /no/such/file: No such file or directory
which is all you really need to know. Making this print anything else is useless. If you had many such files, and cat printed a full stack trace for each, you'd get many pages of error output, and that would only make finding the real problem harder.
For fatal signal handlers (e.g. SIGSEGV) the answer is that most "standard" applications don't actually handle such signals, and simply use the default action, which produces a core dump.
But if they did catch the signal, calling backtrace, backtrace_symbols, or backtrace_symbols_fd from the signal handler would be equally unsafe, and could deadlock, which is much worse than simply dumping core. Consider what happens if you have a long-running script with a 1000 commands in it. You start it, and a week later discover that it didn't make any progress because the second command crashed and deadlocked trying to print the crash stack trace.
Related
I'm wondering if there is a way to catch all segment fault/core dump, and print its callstack? Catching all signals look like a doable way, but I'm not pretty sure how it works, according to some of my experience, it did not always give the outcome I want, sometimes it just failed to catch that core dump, maybe I did something wrong.
The reason I ask this is that I usually debug with a very complicated system, and many segmentfault errors are hard to reproduce, and not practical to run it with gdb line by line. So if I can catch all segmentfault and print some call stack or other information, that would be great to help me debugging.
I'm wondering if there is a way to catch all segment fault ...
Certainly. This can be achieved by registering a signal handler using std::signal.
... and print its callstack?
This is a quite a lot trickier. There is no standard way to inspect that call stack in C++. Linux has several ways, but unfortunately the most conventional way of using the POSIX standard backtrace function is not async-signal safe, so there is no guarantee that it would work in a signal handler.
A simpler approach is to not do this in a signal handler, but instead configure Linux to generate a core dump. You'll get much more information out of that.
... core dump, and print its callstack?
Certainly. You can use a debugger:
gdb /path/to/executable /path/to/corefile
(gdb) bt
and not practical to run it with gdb line by line.
Then run it in gdb but not line by line. Simply run the program within gdb until it segfaults, then print the backtrace.
I use a third library in my c++ program which under certain circumstances emits SIGABRT signal. I know that trying to free non-initialized pointer or something like this can be the cause of this signal. Nevertheless I want to keep running my program after this signal is emitted, to show a message and allow the user to change the settings, in order to cope with this signal.
(I use QT for developing.)
How can I do that?
I use a third library in my c++ program which under certain circumstances emits SIGABRT signal
If you have the source code of that library, you need to correct the bug (and the bug could be in your code).
BTW, probably SIGABRT happens because abort(3) gets indirectly called (perhaps because you violated some conventions or invariants of that library, which might use assert(3) - and indirectly call abort). I guess that in caffe the various CHECK* macros could indirectly call abort. I leave you to investigate that.
If you don't have the source code or don't have the capacity or time to fix that bug in that third party library, you should give up using that library and use something else.
In many cases, you should trust external libraries more than your own code. Probably, you are abusing or misusing that library. Read carefully its documentation and be sure that your own code calling it is using that library correctly and respects its invariants and conventions. Probably the bug is in your own code, at some other place.
I want to keep running my program
This is impossible (or very unreliable, so unreasonable). I guess that your program has some undefined behavior. Be very scared, and work hard to avoid UB.
You need to improve your debugging skills. Learn better how to use the gdb debugger, valgrind, GCC sanitizers (e.g. instrumentation options like -fsanitize=address, -fsanitize=undefined and others), etc...
You reasonably should not try to handle SIGABRT even if in principle you might (but then read carefully signal(7), signal-safety(7) and hints about handling Unix signals in Qt). I strongly recommend to avoid even trying catching SIGABRT.
Unfortunately, you can't.
SIGABRT signal is itself sent right after abort()
Ref:
https://stackoverflow.com/a/3413215/9332965
You can handle SIGABRT, but you probably shouldn't.
The "can" is straightforward - just trap it in the usual way, using signal(). You don't want to return from this signal handler - you probably got here from abort() - possibly originally from assert() - and that function will exit after raising the signal. You could however longjmp() back to a state you set up earlier.
The "shouldn't" is because once SIGABRT has been raised, your data structures (including those of Qt and any other libraries) are likely in an inconsistent state and actually using any of your program's state is likely to be unpredictable at best. Apart from exiting immediately, there's not much you can do other than exec() a replacement program to take over in a sane initial state.
If you just want to show a friendly message, then you perhaps could exec() a small program to do that (or just use xmessage), but beware of exiting this with a success status where you would have had an indication of the SIGABRT otherwise.
Unfortunately there isn't much you can do to prevent SIGABRT from terminating your program. Not without modifying some code that was hopefully written by you.
You would either need to change code to not throw an abort, or you would have to spawn a new process that runs the code instead of the current process. I do not suggest you use a child process to solve this problem. It's most likely caused by misuse of an api or computer resources, such as low memory.
I have a program that uses services from others. If the program crashes, what is the best way to close those services? At server side, I would define some checkers that monitor if a client is invalid periodically. But can we do any thing at client? I am not the sure if the normal RAII still works at this case. My code is written in C and C++.
If your application experiences a hard crash, then no, your carefully crafted cleanup code will not run, whether it is part of an RAII paradigm or a method you call at the end of main. None of an application's cleanup code runs after a crash that causes the application to be terminated.
Of course, this is not true for exceptions. Although those might eventually cause the application to be terminated, they still trigger this termination in a controlled way. Generally, the runtime library will catch an unhandled exception and trigger termination. Along the way, your RAII-based cleanup code will be executed, unless it also throws an exception. Then you're back to being unceremoniously ripped out of memory.
But even if your application's cleanup code can't run, the operating system will still attempt to clean up after you. This solves the problem of unreleased memory, handles, and other system objects. In general, if you crash, you need not worry about releasing these things. Your application's state is inconsistent, so trying to execute a bunch of cleanup code will just lead to unpredictable and potentially erroneous behavior, not to mention wasting a bunch of time. Just crash and let the system deal with your mess. As Raymond Chen puts it:
The building is being demolished. Don't bother sweeping the floor and emptying the trash cans and erasing the whiteboards. And don't line up at the exit to the building so everybody can move their in/out magnet to out. All you're doing is making the demolition team wait for you to finish these pointless housecleaning tasks.
Do what you must; skip everything else.
The only problem with this approach is, as you suggest in this question, when you're managing resources that are not controlled by the operating system, such as a remote resource on another system. In that case, there is very little you can do. The best scenario is to make your application as robust as possible so that it doesn't crash, but even that is not a perfect solution. Consider what happens when the power is lost, e.g. because a user's cat pulled the cord from the wall. No cleanup code could possibly run then, so even if your application never crashes, there may be termination events that are outside of your control. Therefore, your external resources must be robust in the event of failure. Time-outs are a standard method, and a much better solution than polling.
Another possible solution, depending on the particular use case, is to run consistency-checking and cleanup code at application initialization. This might be something that you would do for a service that is intended to run continuously and will be restarted promptly after termination. The next time it restarts, it checks its data and/or external resources for consistency, releases and/or re-initializes them as necessary, and then continues on as normal. Obviously this is a bad solution for a typical application, because there is no guarantee that the user will relaunch it in a timely manner.
As the other answers make clear, hoping to clean up after an uncontrolled crash (i.e., a failure which doesn't trigger the C++ exception unwind mechanism) is probably a path to nowhere. Even if you cover some cases, there will be other cases that fail and you are building in a serious vulnerability to those cases.
You mention that the source of the crashes is that you are "us[ing] services from others". I take this to mean that you are running untrusted code in-process, which is the potential source of crashes. In this case, you might consider running the untrusted code "out of process" and communicating back to your main process through a pipe or shared memory or whatever. Then you isolate the crashes this child process, and can do controlled cleanup in your main process. A separate process is really the lightest weight thing you can do that gives you the strong isolation you need to avoid corruption in the calling code.
If forking a process per-call is performance-prohibitive, you can try to keep the child process alive for multiple calls.
One approach would be for your program to have two modes: normal operation and monitoring.
When started in a usual way, it would :
Act as a background monitor.
Launch a subprocess of itself, passing it an internal argument (something that wouldn't clash with normal arguments passed to it, if any).
When the subprocess exists, it would release any resources held at the server.
When started with the internal argument, it would:
Expose the user interface and "act normally", using the resources of the server.
You might look into atexit, which may give you the functionality you need to release resources upon program termination. I don't believe it is infallible, though.
Having said that, however, you should really be focusing on making sure your program doesn't crash; if you're hitting an error that is "unrecoverable", you should still invest in some error-handling code. If the error is caused by a Seg-Fault or some other similar OS-related error, you can either enable SEH exceptions (not sure if this is Windows-specific or not) to enable you to catch them with a normal try-catch block, or write some Signal Handlers to intercept those errors and deal with them.
I have an application with main thread and additional (detached) process created in it.
In that process we are running network server which sends logs from queue through the network.
The question is: is it possible to do something in segfault handler to wait/finish for sending that log queue. So I want almost 100% delivery of that queue.
While it is possible to write a segfault handler, I highly recommend against it. First off, it's very easy to get your program into a "won't terminate" state due to a segfault in the segfault handler.
Second, as dan3 mentions, the memory of the process is likely in a corrupt state, making it hard to know what will and won't work.
Finally, you lose the opportunity to use the coredump from the process to help track down the problem.
While it's not recommended, it is possible.
My recommendation is to write a small program that avoids memory allocation and the use of pointers as much as possible. Perhaps create buffers as global arrays and only ever access them with limited code that can be reviewed by several skilled developers and tested thoroughly (stress testing is great here). Keep in mind, though, that the message could still get lost by the sender or receiver if they crash, so it may not be worth the effort.
By the way - when Netscape first wrote a version of their browser for Linux, I ran it and it kept getting into a locked-up state. Using the strace program, I quickly found that it was in an infinite segfault loop. Very frustrating, and leading to almost 100% cpu wasted.
You can wait() for a process and pthread_wait() for a thread to finish (you didn't specify clearly which one you use).
Remember that if you are in segfault handler, your memory is messed up (avoid malloc() and free()) and your FILE * could also be borked.
I'd like to apologize in advance, because this is not a very good question.
I have a server application that runs as a service on a dedicated Windows server. Very randomly, this application crashes and leaves no hint as to what caused the crash.
When it crashes, the event logs have an entry stating that the application failed, but gives no clue as to why. It also gives some information on the faulting module, but it doesn't seem very reliable, as the faulting module is usually different on each crash. For example, the latest said it was ntdll, the one before that said it was libmysql, the one before that said it was netsomething, and so on.
Every single thread in the application is wrapped in a try/catch (...) (anything thrown from an exception handler/not specifically caught), __try/__except (structured exceptions), and try/catch (specific C++ exceptions). The application is compiled with /EHa, so the catch all will also catch structured exceptions.
All of these exception handlers do the same thing. First, a crash dump is created. Second, an entry is logged to a new file on disk. Third, an entry is logged in the application logs. In the case of these crashes, none of this is happening. The bottom most exception handler (the try/catch (...)) does nothing, it just terminates the thread. The main application thread is asleep and has no chance of throwing an exception.
The application log files just stop logging. Shortly after, the process that monitors the server notices that it's no longer responding, sends an alert, and starts it again. If the server monitor notices that the server is still running, but just not responding, it takes a dump of the process and reports this, but this isn't happening.
The only other reason for this behavior that I can come up with, aside from uncaught exceptions, is a call to exit or similar. Searching the code brings up no calls to any functions that could terminate the process. I've also made sure that the program isn't terminating normally (i.e. a stop request from the service manager).
We have tried running it with windbg attached (no chance to use Visual Studio, the overhead is too high), but it didn't report anything when the crash occurred.
What can cause an application to crash like this? We're beginning to run out of options and consider that it might be a hardware failure, but that seems a bit unlikely to me.
If your app is evaporating an not generating a dump file, then it is likely that an exception is being generated which your app doesnt (or cant) handle. This could happen in two instances:
1) A top-level exception is generated and there is no matching catch block for that exception type.
2) You have a matching catch block (such as catch(...)), but you are generating an exception within that handler. When this happens, Windows will rip the bones from your program. Your app will simply cease to exist. No dump will be generated, and virtually nothing will be logged, This is Windows' last-ditch effort to keep a rogue program from taking down the entire system.
A note about catch(...). This is patently Evil. There should (almost) never be a catch(...) in production code. People who write catch(...) generally argue one of two things:
"My program should never crash. If anything happens, I want to recover from the exception and continue running. This is a server application! ZOMG!"
-or-
"My program might crash, but if it does I want to create a dump file on the way down."
The former is a naive and dangerous attitude because if you do try to handle and recover from every single exception, you are going to do something bad to your operating footprint. Maybe you'll munch the heap, keep resources open that should be closed, create deadlocks or race conditions, who knows. Your program will suffer from a fatal crash eventually. But by that time the call stack will bear no resemblance to what caused the actual problem, and no dump file will ever help you.
The latter is a noble & robust approach, but the implementation of it is much more difficult that it might seem, and it fraught with peril. The problem is you have to avoid generating any further exceptions in your exception handler, and your machine is already in a very wobbly state. Operations which are normally perfectly safe are suddenly hand grenades. new, delete, any CRT functions, string formatting, even stack-based allocations as simple as char buf[256] could make your application go >POOF< and be gone. You have to assume the stack and the heap both lie in ruins. No allocation is safe.
Moreover, there are exceptions that can occur that a catch block simply can't catch, such as SEH exceptions. For that reason, I always write an unhandled-exception handler, and register it with Windows, via SetUnhandledExceptionFilter. Within my exception handler, I allocate every single byte I need via static allocation, before the program even starts up. The best (most robust) thing to do within this handler is to trigger a seperate application to start up, which will generate a MiniDump file from outside of your application. However, you can generate the MiniDump from within the handler itself if you are extremely careful no not call any CRT function directly or indirectly. Basically, if it isn't an API function you're calling, it probably isn't safe.
I've seen crashes like these happen as a result of memory corruption. Have you run your app under a memory debugger like Purify to see if that sheds some light on potential problem areas?
Analyze memory in a signal handler
http://msdn.microsoft.com/en-us/library/xdkz3x12%28v=VS.100%29.aspx
This isn't a very good answer, but hopefully it might help you.
I ran into those symptoms once, and after spending some painful hours chasing the cause, I found out a funny thing about Windows (from MSDN):
Dereferencing potentially invalid
pointers can disable stack expansion
in other threads. A thread exhausting
its stack, when stack expansion has
been disabled, results in the
immediate termination of the parent
process, with no pop-up error window
or diagnostic information.
As it turns out, due to some mis-designed data sharing between threads, one of my threads would end up dereferencing more or less random pointers - and of course it hit the area just around the stack top sometimes. Tracking down those pointers was heaps of fun.
There's some technincal background in Raymond Chen's IsBadXxxPtr should really be called CrashProgramRandomly
Late response, but maybe it helps someone: every Windows app has a limit on how many handles can have open at any time. We had a service not releasing a handle in some situation, the service would just disappear, after a few days, or at times weeks (depending on the usage of the service).
Finding the leak was great fun :D (use Task Manager to see thread count, handles count, GDI objects, etc)