Detecting application hang - c++

I have a very large, complex (million+ LOC) Windows application written in C++. We receive a handful of reports every day that the application has locked up, and must be forcefully shut down.
While we have extensive reporting about crashes in place, I would like to expand this to include these hang scenarios -- even with heavy logging in place, we have not been able to track down root causes for some of these. We can clearly see where activity stopped - but not why it stopped, even in evaluating output of all threads.
The problem is detecting when a hang occurs. So far, the best I can come up with is a watchdog thread (as we have evidence that background threads are continuing to run w/out issues) which periodically pings the main window with a custom message, and confirms that it is handled in a timely fashion. This would only capture GUI thread hangs, but this does seem to be where the majority of them are occurring. If a reply was not received within a configurable time frame, we would capture a memory and stack dump, and give the user the option of continuing to wait or restarting the app.
Does anyone know of a better way to do this than such a periodic polling of the main window in this way? It seems painfully clumsy, but I have not seen alternatives that will work on our platforms -- Windows XP, and Windows 2003 Server. I see that Vista has much better tools for this, but unfortunately that won't help us.
Suffice it to say that we have done extensive diagnostics on this and have been met with only limited success. Note that attaching windbg in real-time is not an option, as we don't get the reports until hours or days after the incident. We would be able to retrieve a memory dump and log files, but nothing more.
Any suggestions beyond what I'm planning above would be appreciated.

The answer is simple: SendMessageTimeout!
Using this API you can send a message to a window and wait for a timeout before continuing; if the application responds before timeout the is still running otherwise it is hung.

One option is to run your program under your own "debugger" all the time. Some programs, such as GetRight, do this for copy protection, but you can also do it to detect hangs. Essentially, you include in your program some code to attach to a process via the debugging API and then use that API to periodically check for hangs. When the program first starts, it checks if there's a debugger attached to it and, if not, it runs another copy of itself and attaches to it - so the first instance does nothing but act as the debugger and the second instance is the "real" one.
How you actually check for hangs is another whole question, but having access to the debugging API there should be some way to check reasonably efficiently whether the stack has changed or not (ie. without loading all the symbols). Still, you might only need to do this every few minutes or so, so even if it's not efficient it might be OK.
It's a somewhat extreme solution, but should be effective. It would also be quite easy to turn this behaviour on and off - a command-line switch will do or a #define if you prefer. I'm sure there's some code out there that does things like this already, so you probably don't have to do it from scratch.

A suggestion:
Assuming that the problem is due to locking, you could dump your mutex & semaphore states from a watchdog thread. With a little bit of work (tracing your call graph), you can determine how you've arrived at a deadlock, which call paths are mutually blocking, etc.

While a crashdump analysis seems to provide a solution for identifying the problem, in my experience this rarely bears much fruit since it lacks sufficient unambiguous detail of what happened just before the crash. Even with the tool you propose, it would provide little more than circumstantial evidence of what happened. I bet the cause is unprotected shared data, so a lock trace wouldn't show it.
The most productive way of finding this—in my experience—is distilling the application's logic to its essence and identifying where conflicts must be occurring. How many threads are there? How many are GUI? At how many points do the threads interact? Yep, this is good old desk checking. Leading suspect interactions can be identified in a day or two, then just convince a small group of skeptics that the interaction is correct.

Related

How to release resources if a program crashes

I have a program that uses services from others. If the program crashes, what is the best way to close those services? At server side, I would define some checkers that monitor if a client is invalid periodically. But can we do any thing at client? I am not the sure if the normal RAII still works at this case. My code is written in C and C++.
If your application experiences a hard crash, then no, your carefully crafted cleanup code will not run, whether it is part of an RAII paradigm or a method you call at the end of main. None of an application's cleanup code runs after a crash that causes the application to be terminated.
Of course, this is not true for exceptions. Although those might eventually cause the application to be terminated, they still trigger this termination in a controlled way. Generally, the runtime library will catch an unhandled exception and trigger termination. Along the way, your RAII-based cleanup code will be executed, unless it also throws an exception. Then you're back to being unceremoniously ripped out of memory.
But even if your application's cleanup code can't run, the operating system will still attempt to clean up after you. This solves the problem of unreleased memory, handles, and other system objects. In general, if you crash, you need not worry about releasing these things. Your application's state is inconsistent, so trying to execute a bunch of cleanup code will just lead to unpredictable and potentially erroneous behavior, not to mention wasting a bunch of time. Just crash and let the system deal with your mess. As Raymond Chen puts it:
The building is being demolished. Don't bother sweeping the floor and emptying the trash cans and erasing the whiteboards. And don't line up at the exit to the building so everybody can move their in/out magnet to out. All you're doing is making the demolition team wait for you to finish these pointless housecleaning tasks.
Do what you must; skip everything else.
The only problem with this approach is, as you suggest in this question, when you're managing resources that are not controlled by the operating system, such as a remote resource on another system. In that case, there is very little you can do. The best scenario is to make your application as robust as possible so that it doesn't crash, but even that is not a perfect solution. Consider what happens when the power is lost, e.g. because a user's cat pulled the cord from the wall. No cleanup code could possibly run then, so even if your application never crashes, there may be termination events that are outside of your control. Therefore, your external resources must be robust in the event of failure. Time-outs are a standard method, and a much better solution than polling.
Another possible solution, depending on the particular use case, is to run consistency-checking and cleanup code at application initialization. This might be something that you would do for a service that is intended to run continuously and will be restarted promptly after termination. The next time it restarts, it checks its data and/or external resources for consistency, releases and/or re-initializes them as necessary, and then continues on as normal. Obviously this is a bad solution for a typical application, because there is no guarantee that the user will relaunch it in a timely manner.
As the other answers make clear, hoping to clean up after an uncontrolled crash (i.e., a failure which doesn't trigger the C++ exception unwind mechanism) is probably a path to nowhere. Even if you cover some cases, there will be other cases that fail and you are building in a serious vulnerability to those cases.
You mention that the source of the crashes is that you are "us[ing] services from others". I take this to mean that you are running untrusted code in-process, which is the potential source of crashes. In this case, you might consider running the untrusted code "out of process" and communicating back to your main process through a pipe or shared memory or whatever. Then you isolate the crashes this child process, and can do controlled cleanup in your main process. A separate process is really the lightest weight thing you can do that gives you the strong isolation you need to avoid corruption in the calling code.
If forking a process per-call is performance-prohibitive, you can try to keep the child process alive for multiple calls.
One approach would be for your program to have two modes: normal operation and monitoring.
When started in a usual way, it would :
Act as a background monitor.
Launch a subprocess of itself, passing it an internal argument (something that wouldn't clash with normal arguments passed to it, if any).
When the subprocess exists, it would release any resources held at the server.
When started with the internal argument, it would:
Expose the user interface and "act normally", using the resources of the server.
You might look into atexit, which may give you the functionality you need to release resources upon program termination. I don't believe it is infallible, though.
Having said that, however, you should really be focusing on making sure your program doesn't crash; if you're hitting an error that is "unrecoverable", you should still invest in some error-handling code. If the error is caused by a Seg-Fault or some other similar OS-related error, you can either enable SEH exceptions (not sure if this is Windows-specific or not) to enable you to catch them with a normal try-catch block, or write some Signal Handlers to intercept those errors and deal with them.

How do I correctly handle a permanently hung third-party library call in a thread in C++?

I have a device which has an library. Some of its functions are most awesomely ill-behaved, in the "occasionally hang forever" sense.
I have a program which uses this device. If/when it hangs, I need to be able to recover gracefully and reset it. The offending calls should return within milliseconds and are being called in a loop many many times per second.
My first question is: when a thread running the recalcitrant function hangs, what do I do? Even if I litter the thread with interruption points, this happens:
boost::this_thread::interruption_point(); // irrelevant, in the past
deviceLibrary.thatFunction(); // <-- hangs here forever
boost::this_thread::interruption_point(); // never gets here!
The only word I've read on what to do there is to modify the function itself, but that's out of the question for a variety of reasons -- not least of which is "this is already miles outside of my skill set".
I have tried asynchronous launching with C++11 futures:
// this was in a looping thread -- it does not work: wait_for sometimes never returns
std::future<void> future = std::async(std::launch::async,
[this] () { deviceLibrary.thatFunction(*data_ptr); });
if (future.wait_for(std::chrono::seconds(timeout)) == std::future_status::timeout) {
printf("no one will ever read this\n");
deviceLibrary.reset(); // this would work if it ever got here
}
No dice, in that or a number of variations.
I am now trying boost::asio with a thread_group of a number of worker threads running io_service::run(). It works magnificently until the second time it times out. Then I've run out of threads, because each hanging thread eats up one of my thread_group and it never comes back ever.
My latest idea is to call work_threads.create_thread to make a new thread to replace the now-hanging one. So my second question is: if this is a viable way of dealing with this, how should I cope with the slowly amassing group of hung threads? How do I remove them? Is it fine to leave them there?
Incidentally, I should mention that there is in fact a version of deviceLibrary.thatFunction() that has a timeout. It doesn't.
I found this answer but it's C# and Windows specific, and this one which seems relevant. But I'm not so sure about spawning hundreds of extra processes a second (edit: oh right; I could banish all the calls to one or two separate processes. If they communicate well enough and I can share the device between them. Hm...)
Pertinent background information: I'm using MSVC 2013 on Windows 7, but the code has to cross-compile for ARM on Debian with GCC 4.6 also. My level of C++ knowledge is... well... if it seems like I'm missing something obvious, I probably am.
Thanks!
If you want to reliably kill something that's out of your control and may hang, use a separate process.
While process isolation was once considered to be very 'heavy-handed', browsers like Chrome today will implement it on a per-tab basis. Each tab gets a process, the GUI has a process, and if the tab rendering dies it doesn't take down the whole browser.
How can Google Chrome isolate tabs into separate processes while looking like a single application?
Threads are simply not designed for letting a codebase defend itself from ill-behaved libraries. Processes are.
So define the services you need, put that all in one program using your flaky libraries, and use interprocess communication from your main app to speak with the bridge. If the bridge times out or has a problem due to the flakiness, kill it and restart it.
I am only going to answer this part of your text:
when a thread running the recalcitrant function hangs, what do I do?
A thread could invoke inline machine instructions.
These instructions might clear the interrupt flag.
This may cause the code to be non interruptible.
As long as it does not decide to return, you cannot force it to return.
You might be able to force it to die (eg kill the process containing the thread), but you cannot force the code to return.
I hope my answer convinces you that the answer recommending to use a bridge process is in fact what you should do.
The first thing you do is make sure that it's the library that's buggy. Then you create a minimal example that demonstrates the problem (if possible), and send a bug report and the example to the library's developer. Lastly, you cross your fingers and wait.
What you don't do is put your fingers in your ears and say "LALALALALA" while you hide the problem behind layers of crud in an attempt to pretend the problem is gone.

Segfault logic with two threads

I have an application with main thread and additional (detached) process created in it.
In that process we are running network server which sends logs from queue through the network.
The question is: is it possible to do something in segfault handler to wait/finish for sending that log queue. So I want almost 100% delivery of that queue.
While it is possible to write a segfault handler, I highly recommend against it. First off, it's very easy to get your program into a "won't terminate" state due to a segfault in the segfault handler.
Second, as dan3 mentions, the memory of the process is likely in a corrupt state, making it hard to know what will and won't work.
Finally, you lose the opportunity to use the coredump from the process to help track down the problem.
While it's not recommended, it is possible.
My recommendation is to write a small program that avoids memory allocation and the use of pointers as much as possible. Perhaps create buffers as global arrays and only ever access them with limited code that can be reviewed by several skilled developers and tested thoroughly (stress testing is great here). Keep in mind, though, that the message could still get lost by the sender or receiver if they crash, so it may not be worth the effort.
By the way - when Netscape first wrote a version of their browser for Linux, I ran it and it kept getting into a locked-up state. Using the strace program, I quickly found that it was in an infinite segfault loop. Very frustrating, and leading to almost 100% cpu wasted.
You can wait() for a process and pthread_wait() for a thread to finish (you didn't specify clearly which one you use).
Remember that if you are in segfault handler, your memory is messed up (avoid malloc() and free()) and your FILE * could also be borked.

console out in multi-threaded applications

Usually developing applications I am used to print to console in order to get useful debugging/tracing information. The application I am working now since it is multi-threaded sometimes I see my printf overlapping each other.
I tried to synchronize the screen using a mutex but I end up in slowing and blocking the app. How to solve this issue?
I am aware of MT logging libraries but in using them, since I log too much, I slow ( a bit ) my app.
I was thinking to the following idea..instead of logging within my applications why not log outside it? I would like to send logging information via socket to a second application process that actually print out on the screen.
Are you aware of any library already doing this?
I use Linux/gcc.
thanks
afg
You have 3 options. In increasing order of complexity:
Just use a simple mutex within each thread. The mutex is shared by all threads.
Send all the output to a single thread that does nothing but the logging.
Send all the output to a separate logging application.
Under most circumstances, I would go with #2. #1 is fine as a starting point, but in all but the most trivial applications you can run in to problems serializing the application. #2 is still very simple, and simple is a good thing, but it is also quite scalable. You still end up doing the processing in the main application, but for the vast majority of applications you gain nothing by spinning this off to it's own, dedicated application.
Number 3 is what you're going to do in preformance-critical server type applications, but the minimal performance gain you get with this approach is 1: very difficult to achieve, 2: very easy to screw up, and 3: not the only or even most compelling reason people generally take this approach. Rather, people typically take this approach when they need the logging service to be seperated from the applications using it.
Which OS are you using?
Not sure about specific library's, but one of the classical approaches to this sort of problem is to use a logging queue, which is worked by a writer thread, who's job is purely to write the log file.
You need to be aware, either with a threaded approach, or a multi-process approach that the write queue may back up, meaning it needs to be managed, either by discarding entries or by slowing down your application (which is obviously easier if it's the threaded approach).
It's also common to have some way of categorising your logging output, so that you can have one section of your code logging at a high level, whilst another section of your code logs at a much lower level. This makes it much easier to manage the amount of output that's being written to files and offers you the option of releasing the code with the logging in it, but turned off so that it can be used for fault diagnosis when installed.
As I know critical section has less weight.
Critical section
Using critical section
If you use gcc, you could use atomic accesses. Link.
Frankly, a Mutex is the only way you really want to do that, so it's always going to be slow in your case because you're using so many print statements.... so to solve your question then, don't use so many print_f statements; that's your problem to begin with.
Okay, is your solution using a mutex to print? Perhaps you should have a mutex to a message queue which another thread is processing to print; that has a potential hang up, but I think will be faster. So, use an active logging thread that spins waiting for incoming messages to print. The networking solution could work too, but that requires more work; try this first.
What you can do is to have one queue per thread, and have the logging thread routinely go through each of these and post the message somewhere.
This is fairly easy to set up and the amount of contention can be very low (just a pointer swap or two, which can be done w/o locking anything).

On MacOSX, in a C++ program, what guarantees can I have on file IO

I am on MacOSX.
I am writing a multi threaded program.
One thread does logging.
The non-logging threads may crash at any time.
What conventions should I adopt in the logger / what guarantees can I have?
I would prefer a solution where even if I crash during part of a write, previous writes still go to disk, and when reading back the log, I can figure out "ah, I wrote 100 complete enties, then I crashed on the 101th".
Thanks!
I program on Linux, not MacOSX, but probably it's the same there.
If only one thread in your program logs, it means that you buffer the logging data in this logging thread and then it writes it to a file probably some larger portion to avoid too many I/O operations and make the logging process faster.
The bad thing is that if one thread segfaults, the whole process is destroyed along with the buffered data.
The solutions (for Linux) I know of are:
Transfer the logging data through a socket, without using buffering logging thread (syslog for example). In this case the OS will probably take care of the data, written to the socket, and even if your application crashes, the data should be received on the other end and logged successfully.
Don's use logging thread, every thread can log synchronously to a file. In this case the losses of the log data after the crash should be very small or none. It's slower though.
I don't know better solutions for this problem yet, it would be interesting to learn ones though.
As Dmitry says, there's only a few options to ensure you actually capture he logging output. Do you really need to write your own? And does it really need to be on another thread? This may introduce a timing window for a crash to miss logs, when you normally want to log synchronously.
The syslog facility on Unix is the standard means for reliable logging for system services. It essentially solves these sorts of problems you describe (ie. logs are processed out-of-process, so if you crash, your logs still get saved).
If your application is aimed only at Mac OS X, you should have a look at the Apple System Log facility (ASL). It provides a more sophisticated API than syslog and a superset of its functionality.