I am working on a CF9 application, and we have a process that is sometimes required that calls applicationStop()
My understanding of this function is that it shuts down the app, and the app is restarted on next request.
I am using this to reload some application scoped variables and third party stuff, however, it appears as though the onApplicationStart() function is being called before the onApplicationEnd() function has finished processing.
Can anyone confirm if this is intended functionality of the applicationStop() function? The problem it is causing me is that in onApplicationEnd() I am reseting some application scoped stuff which i want to be re-initialised in onApplicationStart(), but if it does not wait until onApplicationEnd() finishes then I get into an inconsistent state.
EDIT
The question was originally really more about whether or not it was expected behaviour that onApplicationStart() was being called whilst onApplicationEnd() was still executing, and I was going to fix my problem with using a lock in onApplicationEnd() to make sure it finished its reloading.
However, I have added a lock:
lock scope="application" type="exlusive" timeout="5" {
And it just seems to ignore the whole block - none of the code (logging including) is executed, and no exceptions are thrown (exceptions are thrown if a lock timesout before executing right?). I assume this is related to the fact that we don't really have the complete application scope in the onApplicationEnd()?
I didn't know the answer for this, so I researched it moderately thoroughly and published my findings.
The bottom line is you should expect it because it does happen. I don't think this is a bug because the onApplicationStart() / onApplicationEnd() methods are event handlers, and not the events themselves, so that one takes a while to run and the other might also be called in the interim is completely legit.
However I think in reality they should be synchronised, as it's not going to be desirable for the code from each to be running concurrently.
I didn't check the locking thing, but might have a look after that coffee I mentioned in my blog article.
Related
I have a device which has an library. Some of its functions are most awesomely ill-behaved, in the "occasionally hang forever" sense.
I have a program which uses this device. If/when it hangs, I need to be able to recover gracefully and reset it. The offending calls should return within milliseconds and are being called in a loop many many times per second.
My first question is: when a thread running the recalcitrant function hangs, what do I do? Even if I litter the thread with interruption points, this happens:
boost::this_thread::interruption_point(); // irrelevant, in the past
deviceLibrary.thatFunction(); // <-- hangs here forever
boost::this_thread::interruption_point(); // never gets here!
The only word I've read on what to do there is to modify the function itself, but that's out of the question for a variety of reasons -- not least of which is "this is already miles outside of my skill set".
I have tried asynchronous launching with C++11 futures:
// this was in a looping thread -- it does not work: wait_for sometimes never returns
std::future<void> future = std::async(std::launch::async,
[this] () { deviceLibrary.thatFunction(*data_ptr); });
if (future.wait_for(std::chrono::seconds(timeout)) == std::future_status::timeout) {
printf("no one will ever read this\n");
deviceLibrary.reset(); // this would work if it ever got here
}
No dice, in that or a number of variations.
I am now trying boost::asio with a thread_group of a number of worker threads running io_service::run(). It works magnificently until the second time it times out. Then I've run out of threads, because each hanging thread eats up one of my thread_group and it never comes back ever.
My latest idea is to call work_threads.create_thread to make a new thread to replace the now-hanging one. So my second question is: if this is a viable way of dealing with this, how should I cope with the slowly amassing group of hung threads? How do I remove them? Is it fine to leave them there?
Incidentally, I should mention that there is in fact a version of deviceLibrary.thatFunction() that has a timeout. It doesn't.
I found this answer but it's C# and Windows specific, and this one which seems relevant. But I'm not so sure about spawning hundreds of extra processes a second (edit: oh right; I could banish all the calls to one or two separate processes. If they communicate well enough and I can share the device between them. Hm...)
Pertinent background information: I'm using MSVC 2013 on Windows 7, but the code has to cross-compile for ARM on Debian with GCC 4.6 also. My level of C++ knowledge is... well... if it seems like I'm missing something obvious, I probably am.
Thanks!
If you want to reliably kill something that's out of your control and may hang, use a separate process.
While process isolation was once considered to be very 'heavy-handed', browsers like Chrome today will implement it on a per-tab basis. Each tab gets a process, the GUI has a process, and if the tab rendering dies it doesn't take down the whole browser.
How can Google Chrome isolate tabs into separate processes while looking like a single application?
Threads are simply not designed for letting a codebase defend itself from ill-behaved libraries. Processes are.
So define the services you need, put that all in one program using your flaky libraries, and use interprocess communication from your main app to speak with the bridge. If the bridge times out or has a problem due to the flakiness, kill it and restart it.
I am only going to answer this part of your text:
when a thread running the recalcitrant function hangs, what do I do?
A thread could invoke inline machine instructions.
These instructions might clear the interrupt flag.
This may cause the code to be non interruptible.
As long as it does not decide to return, you cannot force it to return.
You might be able to force it to die (eg kill the process containing the thread), but you cannot force the code to return.
I hope my answer convinces you that the answer recommending to use a bridge process is in fact what you should do.
The first thing you do is make sure that it's the library that's buggy. Then you create a minimal example that demonstrates the problem (if possible), and send a bug report and the example to the library's developer. Lastly, you cross your fingers and wait.
What you don't do is put your fingers in your ears and say "LALALALALA" while you hide the problem behind layers of crud in an attempt to pretend the problem is gone.
So, this is the problem:
I have written a wrapper class exposing simplified API for the libtorrent c++ library. It (the wrapper) has a stack-allocated member, which is libtorrent's main session object.
The library itself uses boost framework, and its threading features - it is multithreaded. (I must say that I'm not really familiar with boost.)
Now, I wanted to create a simple MFC dialog-based application that will have a couple of buttons for managing the session, progress bar, etc.
The destructor of a libtorrent session may take a while to finish (since it needs to notify the trackers that it's closing). The user is prompted on exit with a MessageBox to confirm download termination, so I thought it was a good idea to put my wrapper object as a member of the app class, rather than the CDialog (the wrapper destructor, and consequently the session's will kick in after the dialog is closed). Libtorrent docs also state that it is a good idea to close UI such as windows before the destructor is invoked.
And here comes the fun part - everything works fine, until I close the dialog. The process continues to live for a couple of seconds, and then crashes with some boost-related locks/critical section stuff (that's where the debugger pointed, some lock / release call in one of the boost's headers)...
EDIT
Seems that while closing, some thread checks are performed from the main window, and it gets into some "irregular" state where it does something that makes the boost fail. I'm thinking some kind of a "join" is needed for the gui thread, to wait for other threads termination...
If anyone understood what I was trying to explain here, and has some idea what am I doing wrong, or has an alternative solution to this concept, I'd really appreciate it.
Thanks.
You can wait for the Boost threads to join prior to exiting. I have an Output_Processor class that uses a Boost thread. I interface to it through a queue. Once I want to shutdown the app, I put a shutdown command in its queue. The Output_Processor thread returns after processing that command. Then my block on join returns and the rest of the app can shutdown gracefully.
...
_output_processor_queue->write(shutdown_command);
// Wait for output processor thread to join.
_output_processor_thread->join();
_output_processor_initialized = false;
...
OK, the problem is resolved.
All I did is that I initially created a dynamic wrapper object, and deleted it after doModal() returns. At that point the main thread blocks, waiting till the deletion operation is over, which is basically until the libtorrent session is destructed. However, the peculiar behavior of non-dynamic object remains.
I have a very large, complex (million+ LOC) Windows application written in C++. We receive a handful of reports every day that the application has locked up, and must be forcefully shut down.
While we have extensive reporting about crashes in place, I would like to expand this to include these hang scenarios -- even with heavy logging in place, we have not been able to track down root causes for some of these. We can clearly see where activity stopped - but not why it stopped, even in evaluating output of all threads.
The problem is detecting when a hang occurs. So far, the best I can come up with is a watchdog thread (as we have evidence that background threads are continuing to run w/out issues) which periodically pings the main window with a custom message, and confirms that it is handled in a timely fashion. This would only capture GUI thread hangs, but this does seem to be where the majority of them are occurring. If a reply was not received within a configurable time frame, we would capture a memory and stack dump, and give the user the option of continuing to wait or restarting the app.
Does anyone know of a better way to do this than such a periodic polling of the main window in this way? It seems painfully clumsy, but I have not seen alternatives that will work on our platforms -- Windows XP, and Windows 2003 Server. I see that Vista has much better tools for this, but unfortunately that won't help us.
Suffice it to say that we have done extensive diagnostics on this and have been met with only limited success. Note that attaching windbg in real-time is not an option, as we don't get the reports until hours or days after the incident. We would be able to retrieve a memory dump and log files, but nothing more.
Any suggestions beyond what I'm planning above would be appreciated.
The answer is simple: SendMessageTimeout!
Using this API you can send a message to a window and wait for a timeout before continuing; if the application responds before timeout the is still running otherwise it is hung.
One option is to run your program under your own "debugger" all the time. Some programs, such as GetRight, do this for copy protection, but you can also do it to detect hangs. Essentially, you include in your program some code to attach to a process via the debugging API and then use that API to periodically check for hangs. When the program first starts, it checks if there's a debugger attached to it and, if not, it runs another copy of itself and attaches to it - so the first instance does nothing but act as the debugger and the second instance is the "real" one.
How you actually check for hangs is another whole question, but having access to the debugging API there should be some way to check reasonably efficiently whether the stack has changed or not (ie. without loading all the symbols). Still, you might only need to do this every few minutes or so, so even if it's not efficient it might be OK.
It's a somewhat extreme solution, but should be effective. It would also be quite easy to turn this behaviour on and off - a command-line switch will do or a #define if you prefer. I'm sure there's some code out there that does things like this already, so you probably don't have to do it from scratch.
A suggestion:
Assuming that the problem is due to locking, you could dump your mutex & semaphore states from a watchdog thread. With a little bit of work (tracing your call graph), you can determine how you've arrived at a deadlock, which call paths are mutually blocking, etc.
While a crashdump analysis seems to provide a solution for identifying the problem, in my experience this rarely bears much fruit since it lacks sufficient unambiguous detail of what happened just before the crash. Even with the tool you propose, it would provide little more than circumstantial evidence of what happened. I bet the cause is unprotected shared data, so a lock trace wouldn't show it.
The most productive way of finding this—in my experience—is distilling the application's logic to its essence and identifying where conflicts must be occurring. How many threads are there? How many are GUI? At how many points do the threads interact? Yep, this is good old desk checking. Leading suspect interactions can be identified in a day or two, then just convince a small group of skeptics that the interaction is correct.
I have used a version of double checked locking in my CF app (before I knew what double checked locking was).
Essentially, I check for the existance of an object. If it is not present, I lock (usually using a named lock) and before I try and create the object I check for existance again. I thought this was a neat way to stop multiple objects being created and stop excessive locking in the system.
This seems to work, in that there is not excessive locking and object duplicates don't get created. However, I have recently learned that Double Checked Locking dosn't work in Java, what I don't know is if this holds true in CF, seeing as CF threads and locks are not quite the same as native Java threads and locks.
To add on to what Ben Doom said about Java, this is fairly standard practice in ColdFusion, specifically with an application initialization routine where you set up your application variables.
Without having at least one lock, you are letting the initial hits to your web application all initialize the application variables at the same time. This assumes that your application is busy enough to warrant this. The danger is only there if your application is busy at the time your application is first starting up.
The first lock makes sure only one request at a time initializes your variables.
The second lock, embedded within the first, will check to make sure a variable defined at the end of your initialization code exists, such as application.started. If it exists, the user is kicked out.
The double-locking pattern has saved my skin on busy sites, however, with VERY busy sites, the queue of requests for the application's initial hit to complete can climb too high, too quickly, and cause the server to crash. The idea is, the requests are waiting for the first hit, which is slow, then the second one breaks into the first cflock, and is quickly rejected. With hundreds or thousands of requests in the queue, growing every millisecond, they are all funneling down to the first cflock block. The solution is to set a very low timeout on the first cflock and not throw (or catch and duck) the lock timeout error.
As a final note, this behavior that I described has been deprecated with ColdFusion 7's onApplicationStart() method of your Application.cfc. If you are using onApplicationStart(), then you shouldn't be locking at all for your application init routine. Application.cfc is well locked already.
To conclude, yes, double-checked locking works in ColdFusion. It's helpful in a few certain circumstances, but do it right. I don't know the schematics of why it works as opposed to Java's threading model, chances are it's manually checking some sort of lookup table in the background of your ColdFusion server.
Java is threadsafe, so it isn't so much that your locks won't work as that they aren't necessary. Basically, in CF 6+, locks are needed for preventing race conditions or creating/althering objects that exist outside Java's control (files, for example).
To open a whole other can of worms...
Why don't you use a Dependency Injection library, such as ColdSpring, to keep track of your objects and prevent circular dependencies.
I've got a C++ Win32 application that has a number of threads that might be busy doing IO (HTTP calls, etc) when the user wants to shutdown the application. Currently, I play nicely and wait for all the threads to end before returning from main. Sometimes, this takes longer than I would like and indeed, it seems kind of pointless to make the user wait when I could just exit. However, if I just go ahead and return from main, I'm likely to get crashes as destructors start getting called while there are still threads using the objects.
So, recognizing that in an ideal, platonic world of virtue, the best thing to do would be to wait for all the threads to exit and then shutdown cleanly, what is the next best real world solution? Simply making the threads exit faster may not be an option. The goal is to get the process dead as quickly as possible so that, for example, a new version can be installed over it. The only disk IO I'm doing is in a transactional db, so I'm not terribly concerned about pulling the plug on that.
Use overlapped IO so that you're always in control of the threads that are dealing with your I/O and can always stop them at any point; you either have them waiting on an IOCP and can post an application level shutdown code to it, OR you can wait on the event in your OVERLAPPED structure AND wait on your 'all threads please shutdown now' event as well.
In summary, avoid blocking calls that you can't cancel.
If you can't and you're stuck in a blocking socket call doing IO then you could always just close the socket from the thread that has decided that it's time to shut down and have the thread that's doing IO always check the 'shutdown now' event before retrying...
I use an exception-based technique that's worked pretty well for me in a number of Win32 applications.
To terminate a thread, I use QueueUserAPC() to queue a call to a function which throws an exception. However, the exception that's thrown isn't derived from the type "Exception", so will only be caught by my thread's wrapper procedure.
The advantages of this are as follows:
No special code needed in your thread to make it 'stoppable' - as soon as it enters an alertable wait state, it will run the APC function.
All destructors get invoked as the exception runs up the stack, so your thread exits cleanly.
The things you need to watch for:
Anything doing catch (...) will eat your exception. User code should always use catch(const Exception &e) or similar!
Make sure your I/O and delays are done in an "alertable" way. For example, this means calling sleepex(N, true) instead of sleep(N).
CPU-bound threads need to call sleepex(0,true) occasionally to check for termination.
You can also 'protect' areas of your code to prevent task termination during critical sections.
Best way: Do your work while the app is running, and do nothing (or as close to) at shutdown (works for startup too). If you stick to that pattern, then you can tear down the threads immediately (rather than "being nice" about it) when the shutdown request comes without worrying about work that still needs to be done.
In your specific situation, you'd probably need to wait for IO to finish (writes, at least) if you're doing local work there. HTTP requests and such you can probably just abandon/close outright (again, unless you're writing something). But if it is the case that you're writing during this shutdown and waiting on that, then you may want to notify the user of that, rather than letting your process look hung while you're wrapping things up.
I'd recommend having your GUI and work be done on different threads. When a user requests a shutdown, dismiss the GUI immediately giving the appearance that the application has closed. Allow the worker threads to close gracefully in the background.
If you want to pull the plug messily, exit(0) will do the trick.
I once had a similar problem, albeit in Visual Basic 6: threads from an app would connect to different servers, download some data, perform some operations looping upon that data, and store on a centralized server the result.
Then, new requirement was that threads should be stoppable from main form. I accomplished this in an easy though dirty fashion, by having the threads stop after N loops (equivalent roughly to half a second) to try to open a mutex with a specific name. Upon success, they immediately stopped whatever they were doing and quit, continued otherwise.
This mutex was created only by the main form, once it was created all the threads would soon close themselves. The disadvantage was that user needed to manually specify it wanted to run the threads again - another button to "Enable threads to run" accomplished this by releasing the mutex :D
This trick is guaranteed to work for mutex operations are atomic. Problem is you're never sure a thread really closed - a failure in the logic of handling the "openMutex succeeded" case could mean it never ends. You also don't know when/if all the threads have closed (assuming your code is right, this would take roughly the same time it takes for the loops to stop and "listen").
With VB's "apartment" model of multi-threading it's somewhat difficult to send info from the threads to the main app back and forth, it's much easier to "fire and forget" or to send it only from the main app to the thread. Thus, the need of these kind of long-cuts. Using C++ you're free to use your multi-threading model, so these constraints might not apply to you.
Whatever you do, do NOT use TerminateThread, especially on anything that could be in OS HTTP calls. You could potentially break IE until reboot.
Change all of your IO to an asynchronous or non-blocking model so that they can watch for termination events.
If you need to shutdown suddenly: Just call ExitProcess - which is what is going to be called just as soon as you return from WinMain anyway. Windows itself creates many worker threads that have no way to be cleaned up - they are terminated by process shutdown.
If you have any threads that are performing writes of some kind - obviously those need a chance to close their resources. But anything else - ignore the bounds checker warnings and just pull the rug from under their feet.
You can call TerminateProcess - this will stop the process immediately, without notifying anyone and without waiting for anything.
*NULL = 0 is the fastest way. if you don't want to crash, call exit() or its win32 equivalent.
Instruct the user to unplug the computer. Short of that, you have to abandon your asynchronous activities to the wind. Or is that HWIND? I can never remember in C++. Of course, you could take the middle road and quickly note in a text file or reg key what action was abandoned so that the next time the program runs it can take up that action again automatically or ask the user if they want to do so. Depending on what data you lose when you abandon the asynch action, you may not be able to do that. If you're interacting with the user, you may want to consider a dialog or some UI interaction that explains why its taking so long.
Personally, I prefer the instruction to the user to just unplug the computer. :)