I have a boost threadpool which I use to do certain tasks. I also have a Sensor class that has the pure virtual function doWork(int total) = 0;. Whenever it is requested, my main process gets the necessary Sensor pointer and tells the threadpool to run Sensor::doWork(int total).
threadpool->schedule(boost::bind(&Sensor::doWork,this,123456));
I am dynamically loading libraries of type Sensor, thus it is out of my control if someone else has faulty coding which results in SEGFAULTS and such. So is there a way for me to (in my main process) handle any errors thrown by Sensor::doWork(int total), clean up the thread, delete that sensor object and notify the console what and where the error has occurred?
Really the only way to handle a segmentation fault here is to run Sensor::doWork in a completely separate process.
In UNIX, this involves using fork (or some other similar means), running Sensor::doWork in the child process, and then somehow shuttling the results back to the parent process.
I assume similar means are available in Windows.
EDIT: I thought I'd flesh out a bit some of the things you can do.
Solution #1: you can work with processes in the same fashion as you would threads. For example, you could create process pool that sit there in a loop of
Wait for a task to be passed in over a pipe or queue or some similar object
Perform the task
Return the results over a pipe or queue or some similar object
And since you're executing the tasks in the other processes, you're protected against them crashing. The main difficulty with this solution is actually communicating between processes; maybe boost's interprocess library will help with that. I've mainly done this sort of thing in python, which has a standard multiprocessing module that handles this stuff for you.
Solution #2: You could divide your application into "safe" and "risky" portions that run in different processes. The "risky" portion executes the Sensor::doWork methods and anything else you might want to do in that process -- but only work that is acceptable to be spontaneously lost if it crashes. The "safe" portion deals with any precious information that you cannot afford to lose, and monitors the "risky" portion, performing some recovery operations when the child crashes. And, of course, whatever other work you decide you want to do in the safe part.
If you got a SIGSEGV, even if you caught it you have no guarantee about your program state so there's pretty much no way to recover.
If you're working with 3rd party libraries, and they're buggy, and the library maintainer won't fix it (and you don't have the source) then your only recourse is to run the third party library from within a totally separate binary that talks to the main binary by some means. See for example firefox and plugin-container.
You might want to register a function callback to catch SIGSEV. In C this can be done using signal. Be aware, however, there is not much you can do, when the OS sends you a SIGSEV (note that it isn't required to). You don't really know in what state your program is in, I'd guess. If for example the heap got corrupt, new and delete operations may fail, so even a plain simple
std::cout << std::string("hello world") << std::endl;
statement, might not work since memory from the heap needs to be allocated.
Best, Christoph
Related
I have a program that uses services from others. If the program crashes, what is the best way to close those services? At server side, I would define some checkers that monitor if a client is invalid periodically. But can we do any thing at client? I am not the sure if the normal RAII still works at this case. My code is written in C and C++.
If your application experiences a hard crash, then no, your carefully crafted cleanup code will not run, whether it is part of an RAII paradigm or a method you call at the end of main. None of an application's cleanup code runs after a crash that causes the application to be terminated.
Of course, this is not true for exceptions. Although those might eventually cause the application to be terminated, they still trigger this termination in a controlled way. Generally, the runtime library will catch an unhandled exception and trigger termination. Along the way, your RAII-based cleanup code will be executed, unless it also throws an exception. Then you're back to being unceremoniously ripped out of memory.
But even if your application's cleanup code can't run, the operating system will still attempt to clean up after you. This solves the problem of unreleased memory, handles, and other system objects. In general, if you crash, you need not worry about releasing these things. Your application's state is inconsistent, so trying to execute a bunch of cleanup code will just lead to unpredictable and potentially erroneous behavior, not to mention wasting a bunch of time. Just crash and let the system deal with your mess. As Raymond Chen puts it:
The building is being demolished. Don't bother sweeping the floor and emptying the trash cans and erasing the whiteboards. And don't line up at the exit to the building so everybody can move their in/out magnet to out. All you're doing is making the demolition team wait for you to finish these pointless housecleaning tasks.
Do what you must; skip everything else.
The only problem with this approach is, as you suggest in this question, when you're managing resources that are not controlled by the operating system, such as a remote resource on another system. In that case, there is very little you can do. The best scenario is to make your application as robust as possible so that it doesn't crash, but even that is not a perfect solution. Consider what happens when the power is lost, e.g. because a user's cat pulled the cord from the wall. No cleanup code could possibly run then, so even if your application never crashes, there may be termination events that are outside of your control. Therefore, your external resources must be robust in the event of failure. Time-outs are a standard method, and a much better solution than polling.
Another possible solution, depending on the particular use case, is to run consistency-checking and cleanup code at application initialization. This might be something that you would do for a service that is intended to run continuously and will be restarted promptly after termination. The next time it restarts, it checks its data and/or external resources for consistency, releases and/or re-initializes them as necessary, and then continues on as normal. Obviously this is a bad solution for a typical application, because there is no guarantee that the user will relaunch it in a timely manner.
As the other answers make clear, hoping to clean up after an uncontrolled crash (i.e., a failure which doesn't trigger the C++ exception unwind mechanism) is probably a path to nowhere. Even if you cover some cases, there will be other cases that fail and you are building in a serious vulnerability to those cases.
You mention that the source of the crashes is that you are "us[ing] services from others". I take this to mean that you are running untrusted code in-process, which is the potential source of crashes. In this case, you might consider running the untrusted code "out of process" and communicating back to your main process through a pipe or shared memory or whatever. Then you isolate the crashes this child process, and can do controlled cleanup in your main process. A separate process is really the lightest weight thing you can do that gives you the strong isolation you need to avoid corruption in the calling code.
If forking a process per-call is performance-prohibitive, you can try to keep the child process alive for multiple calls.
One approach would be for your program to have two modes: normal operation and monitoring.
When started in a usual way, it would :
Act as a background monitor.
Launch a subprocess of itself, passing it an internal argument (something that wouldn't clash with normal arguments passed to it, if any).
When the subprocess exists, it would release any resources held at the server.
When started with the internal argument, it would:
Expose the user interface and "act normally", using the resources of the server.
You might look into atexit, which may give you the functionality you need to release resources upon program termination. I don't believe it is infallible, though.
Having said that, however, you should really be focusing on making sure your program doesn't crash; if you're hitting an error that is "unrecoverable", you should still invest in some error-handling code. If the error is caused by a Seg-Fault or some other similar OS-related error, you can either enable SEH exceptions (not sure if this is Windows-specific or not) to enable you to catch them with a normal try-catch block, or write some Signal Handlers to intercept those errors and deal with them.
I have a device which has an library. Some of its functions are most awesomely ill-behaved, in the "occasionally hang forever" sense.
I have a program which uses this device. If/when it hangs, I need to be able to recover gracefully and reset it. The offending calls should return within milliseconds and are being called in a loop many many times per second.
My first question is: when a thread running the recalcitrant function hangs, what do I do? Even if I litter the thread with interruption points, this happens:
boost::this_thread::interruption_point(); // irrelevant, in the past
deviceLibrary.thatFunction(); // <-- hangs here forever
boost::this_thread::interruption_point(); // never gets here!
The only word I've read on what to do there is to modify the function itself, but that's out of the question for a variety of reasons -- not least of which is "this is already miles outside of my skill set".
I have tried asynchronous launching with C++11 futures:
// this was in a looping thread -- it does not work: wait_for sometimes never returns
std::future<void> future = std::async(std::launch::async,
[this] () { deviceLibrary.thatFunction(*data_ptr); });
if (future.wait_for(std::chrono::seconds(timeout)) == std::future_status::timeout) {
printf("no one will ever read this\n");
deviceLibrary.reset(); // this would work if it ever got here
}
No dice, in that or a number of variations.
I am now trying boost::asio with a thread_group of a number of worker threads running io_service::run(). It works magnificently until the second time it times out. Then I've run out of threads, because each hanging thread eats up one of my thread_group and it never comes back ever.
My latest idea is to call work_threads.create_thread to make a new thread to replace the now-hanging one. So my second question is: if this is a viable way of dealing with this, how should I cope with the slowly amassing group of hung threads? How do I remove them? Is it fine to leave them there?
Incidentally, I should mention that there is in fact a version of deviceLibrary.thatFunction() that has a timeout. It doesn't.
I found this answer but it's C# and Windows specific, and this one which seems relevant. But I'm not so sure about spawning hundreds of extra processes a second (edit: oh right; I could banish all the calls to one or two separate processes. If they communicate well enough and I can share the device between them. Hm...)
Pertinent background information: I'm using MSVC 2013 on Windows 7, but the code has to cross-compile for ARM on Debian with GCC 4.6 also. My level of C++ knowledge is... well... if it seems like I'm missing something obvious, I probably am.
Thanks!
If you want to reliably kill something that's out of your control and may hang, use a separate process.
While process isolation was once considered to be very 'heavy-handed', browsers like Chrome today will implement it on a per-tab basis. Each tab gets a process, the GUI has a process, and if the tab rendering dies it doesn't take down the whole browser.
How can Google Chrome isolate tabs into separate processes while looking like a single application?
Threads are simply not designed for letting a codebase defend itself from ill-behaved libraries. Processes are.
So define the services you need, put that all in one program using your flaky libraries, and use interprocess communication from your main app to speak with the bridge. If the bridge times out or has a problem due to the flakiness, kill it and restart it.
I am only going to answer this part of your text:
when a thread running the recalcitrant function hangs, what do I do?
A thread could invoke inline machine instructions.
These instructions might clear the interrupt flag.
This may cause the code to be non interruptible.
As long as it does not decide to return, you cannot force it to return.
You might be able to force it to die (eg kill the process containing the thread), but you cannot force the code to return.
I hope my answer convinces you that the answer recommending to use a bridge process is in fact what you should do.
The first thing you do is make sure that it's the library that's buggy. Then you create a minimal example that demonstrates the problem (if possible), and send a bug report and the example to the library's developer. Lastly, you cross your fingers and wait.
What you don't do is put your fingers in your ears and say "LALALALALA" while you hide the problem behind layers of crud in an attempt to pretend the problem is gone.
I know you cannot kill a boost thread, but can you change it's task?
Currently I have an array of 8 threads. When a button is pressed, these threads are assigned a task. The task which they are assigned to do is completely independent of the main thread and the other threads. None of the the threads have to wait or anything like that, so an interruption point is never reach.
What I need is to is, at anytime, change the task that each thread is doing. Is this possible? I have tried looping through the array of threads and changing what each thread object points to to a new one, but of course that doesn't do anything to the old threads.
I know you can interrupt pThreads, but I cannot find a working link to download the library to check it out.
A thread is not some sort of magical object that can be made to do things. It is a separate path of execution through your code. Your code cannot be made to jump arbitrarily around its codebase unless you specifically program it to do so. And even then, it can only be done within the rules of C++ (ie: calling functions).
You cannot kill a boost::thread because killing a thread would utterly wreck some of the most fundamental assumptions a programmer makes. You now have to take into account the possibility that the next line doesn't execute for reasons that you can neither predict nor prevent.
This isn't like exception handling, where C++ specifically requires destructors to be called, and you have the ability to catch exceptions and do special cleanup. You're talking about executing one piece of code, then suddenly inserting a call to some random function in the middle of already compiled code. That's not going to work.
If you want to be able to change the "task" of a thread, then you need to build that thread with "tasks" in mind. It needs to check every so often that it hasn't been given a new task, and if it has, then it switches to doing that. You will have to define when this switching is done, and what state the world is in when switching happens.
In my code the main loop looks like the following
while ( (data = foo()) != NULL ) {
// do processing on data here
}
where foo() is written in C (it fetches the next frame in a video stream, using libavcodec, if you're curious).
My problem is that due to reasons too complicated to go in here, sometimes foo() hangs, which stops the whole program. What I want to do is to detect this condition, i.e. foo() is taking more than N seconds and if this is so take action.
I thought of creating a separate thread to run foo() to implement this by I haven't done any multithreaded programming before. Here's what I want to do:
Main thread creates a child thread and which calls foo()
When foo() is done, the child thread returns
Main thread processes data returned by foo()
If the child takes more than a specified number of time an action is taken by the main thread.
Steps 1-4 are repeated as long as foo() doesn't return null, which signals the end.
How do I go about doing this? Do I need three threads (main, to run foo() and for timing)?
Thanks!
This is exceedingly difficult to do well. The problem is what you're going to do when foo hangs. Nearly the only thing you can do at that point is abort the program (not just the thread) and start over -- killing the thread and attempting to re-start it might work, but it's dangerous at best. The OS will clean up resources when you kill a process, but not when you kill a single thread. It's essentially impossible to figure out what resources belong exclusively to that thread, and what might be shared with some other thread in the process.
That being the case, perhaps you could move the hanging-prone part to a separate process instead, and kill/restart that process when/if it hangs? You'd then send the data to the parent process via some normal form of IPC (e.g., a pipe). In this case, you could have two threads in the parent (processor and watchdog), or (if available) you could do some sort of asynchronous read with time out, and kill the child when/if the read times out (using only one thread).
How do I go about doing this?
You don't. The hard thing is that there is no reliable way to stop a thread - assuming the hang is in libavcodec, interrupting/killing a thread stuck in code you do not have control over leads to more problems than it solves(it might just be memory and file handle leaks if you're not too unlucky). The thread has to stop itself - but that's not an option if you're stuck inside libavcodec.
Many threading implementation doesn't let you kill threads either - though you might request that the thread cancels , if it's stuck in a infinite loop, it'll never cancel though as the cancel requests are processed only at certain boundary points in the OS or low level library calls.
To work around a buggy library like that in a reliable way, you need process isolation. What you do is create a separate program out of your foo() function, execute that and communicated with it using its stdin/stout streams - or some other form of IPC. Talking to an external program, you have various options for doing I/O with timeouts, and can kill the program when you determin it's hanging.
On Linux you can use pthread_timedjoin_np to make this happen with two threads really easily.
I think you can do this with two threads and use the sleep() command in the main thread for the timing part as long as you don't need to do other work there.
You'd probably be better off just fixing what ever is hanging your application.
I'm looking for a way to restart a thread, either from inside that thread's context or from outside the thread, possibly from within another process. (Any of these options will work.) I am aware of the difficulty of hibernating entire processes, and I'm pretty sure that those same difficulties attend to threads. However, I'm asking anyway in the hopes that someone has some insight.
My goal is to pause, save to file, and restart a running thread from its exact context with no modification to that thread's code, or rather, modification in only a small area - i.e., I can't go writing serialization functions throughout the code. The main block of code must be unmodified, and will not have any global/system handles (file handles, sockets, mutexes, etc.) Really down-and-dirty details like CPU registers do not need to be saved; but basically the heap, stack, and program counter should be saved, and anything else required to get the thread running again logically correctly from its save point. The resulting state of the program should be no different, if it was saved or not.
This is for a debugging program for high-reliability software; the goal is to run simulations of the software with various scripts for input, and be able to pause a running simulation and then restart it again later - or get the sim to a branch point, save it, make lots of copies and then run further simulations from the common starting point. This is why the main program cannot be modified.
The main thread language is in C++, and should run on Windows and Linux, however if there is a way to only do this on one system, then that's acceptable too.
Thanks in advance.
I think what you're asking is much more complicated than you think. I am not too familiar with Windows programming but here are some of the difficulties you'll face in Linux.
A saved thread can only be restored from the root process that originally spawned the thread, otherwise the dynamic libraries would be broken. Because of this saving to disk is essentially meaningless. The reason is dynamic libraries are loaded at different address each time they're loaded. The only way around this would be to take complete control of dynamically linking, no small feat. It's possible, but pretty scary.
The suspended thread will have variables in the the heap. You'd need to be able to find all globals 'owned' by the thread. The 'owned' state of any piece of the heap cannot be determined. In the future it may be possible with the C++0x's garbage collection ABI. You can't just assume the whole stack belongs to the thread to be paused. The main thread uses the heap when creating threads. So blowing away the heap when deserializing the paused thread would break the main thread.
You need to address the issues with globals. And not just the globals from created in the threads. Globals (or statics) can and often are created in dynamic libraries.
There are more resources to a program than just memory. You have file handles, network sockets, database connections, etc. A file handle is just a number. serializing its memory is completely meaningless without the context of the process the file was opened in.
All that said. I don't think the core problem is impossible, just that you should consider a different approach.
Anyway to try to implement this the thread to paused needs to be in a known state. I imagine the thread to be stoped would call a library function meant the halt the process so it could be resumed.
I think the linux system call fork is your friend. Fork perfectly duplicates a process. Have the system run to the desired point and fork. One fork wait to fork others. The second fork runs one set of input.
once it completes the first fork can for again. Again the second fork can run another set of input.
continue ad infinitum.
Threads run in the context of a process. So if you want to do anything like persist a thread state to disk, you need to "hibernate" the entire process.
You will need to serialise the entire set of the processes data. And you'll need to store the current thread execution point. I think serialising the process is do-able (check out boost::serialize) but the thread stop point is a lot more difficult. I would put places where it can be stopped through the code, but as you say, you cannot modify the code.
Given that problem, you're looking at virtualising the platform the app is running on, and using its suspend functionality to pause the entire thing. You might find more information about how to do this in the virtualisation vendor's features, eg Xen.
As the whole logical address space of the program is part of the thread's context, you would have to hibernate the whole process.
If you can guarantee that the thread only uses local variables, you could save its stack. It is easy to suspend a thread with pthreads, but I don't see how you could access its stack from outside then.
The way you would have to do this is via VM Snapshots; get a copy of VMWare Workstation, then you can write code to automate starting/stopping/snapshotting the machine at different points. Any other approach is pretty untenable, as while you might be able to freeze and dethaw a process, you can't reconstruct the system state it expects (all the stuff that Caspin mentions like file handles et al.)