Structured exception handling with a multi-threaded server

Structured exception handling with a multi-threaded server - c++

This article gives a good overview on why structured exception handling is bad. Is there a way to get the robustness of stopping your server from crashing, while getting past the problems mentioned in the article?
I have a server software that runs about 400 connected users concurrently. But if there is a crash all 400 users are affected. We added structured exception handling and enjoyed the results for a while, but eventually had to remove it because of some crashes causing the whole server to hang (which is worse than just having it crash and restart itself).
So we have this:
With SEH: only 1 user of the 400 get a problem for most crashes
Without SEH: If any user gets a crash, all 400 are affected.
But sometimes with SEH: Server hangs, all 400 are affected and future users that try to connect.

Using SEH because your program crashes randomly is a bad idea. It's not magic pixie dust that you can sprinkle on your program to make it stop crashing. Tracking down and fixing the bugs that cause the crashes is the right solution.
Using SEH when you really need to handle a structured exception is fine. Larry Osterman made a followup post explaining what situations require SEH: memory mapped files, RPC, and security boundary transitions.

Break your program up into worker processes and a single server process. The server process will handle initial requests and then hand them off the the worker processes. If a worker process crashes, only the users on that worker are affected. Don't use SEH for general exception handling - as you have found out, it can and will leave you wide open to deadlocks, and you can still crash anyway.

Fix the bugs in your program ? ;)
Personally I'd keep the SEH handlers in, have them dump out a call stack of where the access violation or whatever happened and fix the problems. The 'sometimes the server hangs' problem is probably due to deadlocks caused by the thread that had the SEH exception keeping something locked and so is unlikely to be related to the fact that you're using SEH itself.

Related

How can I block in my Qt application signal SIGSEGV from cURL library?

My Qt app uses cURL library to send HTTP requests and sometimes cURL sends SIGSEGV and after that my app crashes.
Is it possible to catch this signal and prevent segmentation fault ?

TL;DR:
Don't attempt to block or ignore SIGSEGV. It will just bring you pain.
Explanation:
SIGSEGV means very bad things have happened, and your program has wandered into memory that it is not allowed to access. This means your program is utterly hosed.
Pardon my Canadian.
Your program is broken. You can't trust it to behave in any rational fashion anymore and should actually thank that SIGSEGV for letting you know this. As much as you don't want to see SIGSEGV, it's a hell of a lot better than the alternative of a broken program continuing to run and spewing out false information.
You can catch a SIGSEGV with a signal handler, but say you do catch and try to handle it. Odds are the program goes right back into the code that triggered the SIGSEGV and very likely raises it again as soon as the signal handler. Or does something else weird. Possibly something worse.
It's not really even worth catching SIGSEGV with a signal handler to make an attempt to output a message, because what's the message going to say? "Oh Smurf. Something very bad happened."?
So, yes, you can catch it, but you can't do anything to recover. The program is already too damaged to continue. That leaves preventing SIGSEGV in the first place, and that means fixing your code.
The best thing you can do is determine why the program crashed. The easiest way to do that is run your program with whatever debugger came with your development environment, wait for the crash, then inspect the backtrace for hints on how the program came to meet its doom.
Typically a link to Eric Lippert's How to debug small programs can be found right about here, and I can't think of a good reason to leave it out.
One other thing, though. cURL is a pretty solid library. Odds are overwhelmingly good that you are using it wrong or passed it bad information: a dead pointer, an unterminated C-style string, an pointer to an explosive function. I'd start by looking at how you are using cURL.

No, it's not. Instead, fix the bug and prevent the segmentation fault that way. Presumably your platform supplies some kind of debugger.

How to release resources if a program crashes

I have a program that uses services from others. If the program crashes, what is the best way to close those services? At server side, I would define some checkers that monitor if a client is invalid periodically. But can we do any thing at client? I am not the sure if the normal RAII still works at this case. My code is written in C and C++.

If your application experiences a hard crash, then no, your carefully crafted cleanup code will not run, whether it is part of an RAII paradigm or a method you call at the end of main. None of an application's cleanup code runs after a crash that causes the application to be terminated.
Of course, this is not true for exceptions. Although those might eventually cause the application to be terminated, they still trigger this termination in a controlled way. Generally, the runtime library will catch an unhandled exception and trigger termination. Along the way, your RAII-based cleanup code will be executed, unless it also throws an exception. Then you're back to being unceremoniously ripped out of memory.
But even if your application's cleanup code can't run, the operating system will still attempt to clean up after you. This solves the problem of unreleased memory, handles, and other system objects. In general, if you crash, you need not worry about releasing these things. Your application's state is inconsistent, so trying to execute a bunch of cleanup code will just lead to unpredictable and potentially erroneous behavior, not to mention wasting a bunch of time. Just crash and let the system deal with your mess. As Raymond Chen puts it:
The building is being demolished. Don't bother sweeping the floor and emptying the trash cans and erasing the whiteboards. And don't line up at the exit to the building so everybody can move their in/out magnet to out. All you're doing is making the demolition team wait for you to finish these pointless housecleaning tasks.
Do what you must; skip everything else.
The only problem with this approach is, as you suggest in this question, when you're managing resources that are not controlled by the operating system, such as a remote resource on another system. In that case, there is very little you can do. The best scenario is to make your application as robust as possible so that it doesn't crash, but even that is not a perfect solution. Consider what happens when the power is lost, e.g. because a user's cat pulled the cord from the wall. No cleanup code could possibly run then, so even if your application never crashes, there may be termination events that are outside of your control. Therefore, your external resources must be robust in the event of failure. Time-outs are a standard method, and a much better solution than polling.
Another possible solution, depending on the particular use case, is to run consistency-checking and cleanup code at application initialization. This might be something that you would do for a service that is intended to run continuously and will be restarted promptly after termination. The next time it restarts, it checks its data and/or external resources for consistency, releases and/or re-initializes them as necessary, and then continues on as normal. Obviously this is a bad solution for a typical application, because there is no guarantee that the user will relaunch it in a timely manner.

As the other answers make clear, hoping to clean up after an uncontrolled crash (i.e., a failure which doesn't trigger the C++ exception unwind mechanism) is probably a path to nowhere. Even if you cover some cases, there will be other cases that fail and you are building in a serious vulnerability to those cases.
You mention that the source of the crashes is that you are "us[ing] services from others". I take this to mean that you are running untrusted code in-process, which is the potential source of crashes. In this case, you might consider running the untrusted code "out of process" and communicating back to your main process through a pipe or shared memory or whatever. Then you isolate the crashes this child process, and can do controlled cleanup in your main process. A separate process is really the lightest weight thing you can do that gives you the strong isolation you need to avoid corruption in the calling code.
If forking a process per-call is performance-prohibitive, you can try to keep the child process alive for multiple calls.

One approach would be for your program to have two modes: normal operation and monitoring.
When started in a usual way, it would :
Act as a background monitor.
Launch a subprocess of itself, passing it an internal argument (something that wouldn't clash with normal arguments passed to it, if any).
When the subprocess exists, it would release any resources held at the server.
When started with the internal argument, it would:
Expose the user interface and "act normally", using the resources of the server.

You might look into atexit, which may give you the functionality you need to release resources upon program termination. I don't believe it is infallible, though.
Having said that, however, you should really be focusing on making sure your program doesn't crash; if you're hitting an error that is "unrecoverable", you should still invest in some error-handling code. If the error is caused by a Seg-Fault or some other similar OS-related error, you can either enable SEH exceptions (not sure if this is Windows-specific or not) to enable you to catch them with a normal try-catch block, or write some Signal Handlers to intercept those errors and deal with them.

Application crash with no explanation

I'd like to apologize in advance, because this is not a very good question.
I have a server application that runs as a service on a dedicated Windows server. Very randomly, this application crashes and leaves no hint as to what caused the crash.
When it crashes, the event logs have an entry stating that the application failed, but gives no clue as to why. It also gives some information on the faulting module, but it doesn't seem very reliable, as the faulting module is usually different on each crash. For example, the latest said it was ntdll, the one before that said it was libmysql, the one before that said it was netsomething, and so on.
Every single thread in the application is wrapped in a try/catch (...) (anything thrown from an exception handler/not specifically caught), __try/__except (structured exceptions), and try/catch (specific C++ exceptions). The application is compiled with /EHa, so the catch all will also catch structured exceptions.
All of these exception handlers do the same thing. First, a crash dump is created. Second, an entry is logged to a new file on disk. Third, an entry is logged in the application logs. In the case of these crashes, none of this is happening. The bottom most exception handler (the try/catch (...)) does nothing, it just terminates the thread. The main application thread is asleep and has no chance of throwing an exception.
The application log files just stop logging. Shortly after, the process that monitors the server notices that it's no longer responding, sends an alert, and starts it again. If the server monitor notices that the server is still running, but just not responding, it takes a dump of the process and reports this, but this isn't happening.
The only other reason for this behavior that I can come up with, aside from uncaught exceptions, is a call to exit or similar. Searching the code brings up no calls to any functions that could terminate the process. I've also made sure that the program isn't terminating normally (i.e. a stop request from the service manager).
We have tried running it with windbg attached (no chance to use Visual Studio, the overhead is too high), but it didn't report anything when the crash occurred.
What can cause an application to crash like this? We're beginning to run out of options and consider that it might be a hardware failure, but that seems a bit unlikely to me.

If your app is evaporating an not generating a dump file, then it is likely that an exception is being generated which your app doesnt (or cant) handle. This could happen in two instances:
1) A top-level exception is generated and there is no matching catch block for that exception type.
2) You have a matching catch block (such as catch(...)), but you are generating an exception within that handler. When this happens, Windows will rip the bones from your program. Your app will simply cease to exist. No dump will be generated, and virtually nothing will be logged, This is Windows' last-ditch effort to keep a rogue program from taking down the entire system.
A note about catch(...). This is patently Evil. There should (almost) never be a catch(...) in production code. People who write catch(...) generally argue one of two things:
"My program should never crash. If anything happens, I want to recover from the exception and continue running. This is a server application! ZOMG!"
-or-
"My program might crash, but if it does I want to create a dump file on the way down."
The former is a naive and dangerous attitude because if you do try to handle and recover from every single exception, you are going to do something bad to your operating footprint. Maybe you'll munch the heap, keep resources open that should be closed, create deadlocks or race conditions, who knows. Your program will suffer from a fatal crash eventually. But by that time the call stack will bear no resemblance to what caused the actual problem, and no dump file will ever help you.
The latter is a noble & robust approach, but the implementation of it is much more difficult that it might seem, and it fraught with peril. The problem is you have to avoid generating any further exceptions in your exception handler, and your machine is already in a very wobbly state. Operations which are normally perfectly safe are suddenly hand grenades. new, delete, any CRT functions, string formatting, even stack-based allocations as simple as char buf[256] could make your application go >POOF< and be gone. You have to assume the stack and the heap both lie in ruins. No allocation is safe.
Moreover, there are exceptions that can occur that a catch block simply can't catch, such as SEH exceptions. For that reason, I always write an unhandled-exception handler, and register it with Windows, via SetUnhandledExceptionFilter. Within my exception handler, I allocate every single byte I need via static allocation, before the program even starts up. The best (most robust) thing to do within this handler is to trigger a seperate application to start up, which will generate a MiniDump file from outside of your application. However, you can generate the MiniDump from within the handler itself if you are extremely careful no not call any CRT function directly or indirectly. Basically, if it isn't an API function you're calling, it probably isn't safe.

I've seen crashes like these happen as a result of memory corruption. Have you run your app under a memory debugger like Purify to see if that sheds some light on potential problem areas?

Analyze memory in a signal handler
http://msdn.microsoft.com/en-us/library/xdkz3x12%28v=VS.100%29.aspx

This isn't a very good answer, but hopefully it might help you.
I ran into those symptoms once, and after spending some painful hours chasing the cause, I found out a funny thing about Windows (from MSDN):
Dereferencing potentially invalid
pointers can disable stack expansion
in other threads. A thread exhausting
its stack, when stack expansion has
been disabled, results in the
immediate termination of the parent
process, with no pop-up error window
or diagnostic information.
As it turns out, due to some mis-designed data sharing between threads, one of my threads would end up dereferencing more or less random pointers - and of course it hit the area just around the stack top sometimes. Tracking down those pointers was heaps of fun.
There's some technincal background in Raymond Chen's IsBadXxxPtr should really be called CrashProgramRandomly

Late response, but maybe it helps someone: every Windows app has a limit on how many handles can have open at any time. We had a service not releasing a handle in some situation, the service would just disappear, after a few days, or at times weeks (depending on the usage of the service).
Finding the leak was great fun :D (use Task Manager to see thread count, handles count, GDI objects, etc)

Catch unhandled exception of invisible thread

In my C++ application i use an activeX component that runs its own thread (or several I don't know). Sometimes this components throws exceptions. I would like to catch these exceptions and do recovery instead of my entire application crashing. But since I don't have access to its source code or thread I am unsure how it would be done.
The only solution I can think of is to run it in its own process. Using something like CreateProcess and then CreateRemoteThread, unsure how it could be implemented.
Any suggestion on how to go about solving this?

If the ActiveX component is launching its own threads, then there isn't a lot that you can do. You could set a global exception handler and try to swallow exceptions, but this creates a high likelihood that your program state will become corrupted and lead to bizarre "impossible" crashes down the road.
Running the buggy component in a separate process is the most robust solution, as you'll be able to identify and recover from fatal errors without compromising your own program state.

Try setting up an exception filter with SetUnhandledExceptionFilter().

How can I catch an application crash or exit in mshtml?

Our application is using mshtml. That dll is causing our application to exit ungracefully due to well known problems in mshtml since we don't install newer browsers on users' machines. We just use what they have already.
The SetUnhandledExceptionFilter() does not handle this, nor does a try/catch block around the calls into mshtml. The exception filter does catch other exceptions.
The exception settings are /EHa.
When I remote debug the crash I see:
unhandled exception - access violation
In mshtml but if I don't attach to the process with a debugger, the application just exits.
What do we need to do to catch the exception?
Edit:
This is an old version of IE6.

Seems to be that MSHTML functions passes necessary data to a separate thread. That separate thread processes your request and the exception takes place. That's why you cannot catch exception via try/catch block. You should check it in the debugger. If that is true the only way to catch exceptions from other threads is to set hooks for TerminateThread and TerminateProcess functions. Check out CApiHook class by Jeffrey Richter for that purpose(or other implementations). But it will make your program to be incompatible with /NXCOMPAT compiler flag.
Your second option is to install all important OS updates.

Almost there. It's not SetUnhandledExceptionFilter() but AddVectoredExceptionHandler you want. With that said, you can get the first shot at this exception.
Of course I'm wondering what you're going to do afterwards. TerminateThread is probably the only option you have, but that may very well deadlock MSHTML. So that needs killing too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js