Detecting and recovering from Windows TDR? - c++

I've run into an odd issue with some OpenCL code that I'm working on where every once in a blue moon, Windows TDR will kick in and reset the GPU. The offending kernel runs for only 150ms and will run thousands of times (over the course of many hours) before the TDR kills it off, so I'm certain that the kernel itself isn't to blame.
My concern is that once the TDR kicks in, the kernel dies and the program is stuck in an eternal state of limbo. From what I can tell the call to clFinish never returns.
Is there a way to detect if a kernel has died off so that it can be handled gracefully?

I managed to come up with a solution, although it's far from optimal.
I've modified the program so that the OpenCL processing is done in a separate thread. I created a global shared watchdog variable between the parent and child process. When the parent spawns the processing function as a thread, it sets the variable to the current time in milliseconds. When the processing thread finishes, it reset the watchdog variable to zero.
While the parent thread waits for the processing thread to finish, it keeps an eye on the watchdog timer. If the timer exceeds a certain threshold then the program forcefully terminates itself without waiting for the child process to return.
This solution works with or without Windows TDR set. If TDR is set and the driver resets, the call to clFinish() will never return and the parent will terminate once the watchdog timer trips. If TDR is not set, the runaway process will freeze the display, but once the watchdog timer trips, the parent will terminate processing, ending the freeze.
Now that I have a watchdog set up, I simply wrapped my program in a script: if it terminated in error (positive return code) then the program is rerun.

Ideally, you should get an error code from clFinish or clWaitForEvents with the OpenCL event object generated when enqueuing the kernel. Since TDR resets the graphics driver, I don't think any OpenCL implementation will work reliably, meaning there is no recovery route.
Rather disable TDR completely. It is only worthwhile when you debug code that gets stuck in an infinite loop that permanently keeps the GPU busy.
If you want to keep TDR but can change the code then using some sort of thread sleep function to delay your code for a few milliseconds could also alleviate this problem, at the expense of sacrificing processing speed. This gives the graphics card a chance to respond to display rendering commands so that TDR is not triggered.

Related

pthread wait() takes way longer than expected when the main window is hidden

We are having an issue when waiting in a thread on MacOS and the main window is hidden, the wait function takes up to 10 seconds even if we request it to wait 100ms.
The main program is running on a Cocoa window, and an other thread is running permanently, waiting 100ms every iteration.
Everything works fine when the main window is visible, but once the window is hidden, the problem starts happening after some time, i.e. the wait starts waiting for several seconds. We suspect the system to stop waking up the application as often because it's not visible anymore.
We are using pthread_con_wait, but the same problem happens using usleep or boost::sleep (which are probably using the same underneath).
Is there a way to prevent this or a flag to set to tell the system we are still running and we want to be woken up?
Thanks
If it's OS v10.9 or later, your app could be napping away:
Power Efficiency Guide for Mac Apps
Document says, it can be prevented with NSProcessInfo class.
Managing Activities
The system has heuristics to improve battery life, performance, and
responsiveness of applications for the benefit of the user. You can
use the following methods to manage activities that give hints to the
system that your application has special requirements:
beginActivityWithOptions:reason:
endActivity:
performActivityWithOptions:reason:usingBlock:
In response to creating an activity, the system will disable some or
all of the heuristics so your application can finish quickly while
still providing responsive behavior if the user needs it.
...
id activity = [[NSProcessInfo processInfo] ?
beginActivityWithOptions:NSActivityLatencyCritical
reason:#"Good Reason"];
// Perform some work.
[[NSProcessInfo processInfo] endActivity:activity];
Note,
NSActivityLatencyCritical
Flag to indicate the activity requires the highest amount of timer and I/O precision available.
IMPORTANT Very few applications should need to use this constant.

Why are threads interrupted even when atexit or ConsoleHandler for SetConsoleCtrlhandler is executed?

I have a multithreaded application under Windows 7.
I need to correctly finish jobs in threads which have an open descriptors, connections and so on when a user presses 'X' in the corner of command line, 'Ctrl+C', shuts down OS and so on.
I've set a handler for SetConsoleHandler which sets appropriate flags for other threads to correctly finish their job. But all of them are interrupted and the y exit with code 0xc000013a. SOmetimes even my handler doesn't have time to set flag.
The same problem remains when I try to do the same operations in atexit handler.
Why are all threads stopped even during interruption handler? How can I avoid this and let all my threads finish their job?
sets appropriate flags for other threads to correctly finish their job
Usually it's not enough. You also must wait the threads to finish (thread.join(), or WaitForMultipleObjects, or something similar).
The problem in my case was that some of child-children thread used timed-waiting on system resources so each of them needed to wake from waiting to join thread. And all of them were stopping consecutively so they required too much time to stop.

SDL_PollEvent vs SDL_WaitEvent

So I was reading this article which contains 'Tips and Advice for Multithreaded Programming in SDL' - https://vilimpoc.org/research/portmonitorg/sdl-tips-and-tricks.html
It talks about SDL_PollEvent being inefficient as it can cause excessive CPU usage and so recommends using SDL_WaitEvent instead.
It shows an example of both loops but I can't see how this would work with a game loop. Is it the case that SDL_WaitEvent should only be used by things which don't require constant updates ie if you had a game running you would perform game logic each frame.
The only things I can think it could be used for are programs like a paint program where there is only action required on user input.
Am I correct in thinking I should continue to use SDL_PollEvent for generic game programming?
If your game only updates/repaints on user input, then you could use SDL_WaitEvent. However, most games have animation/physics going on even when there is no user input. So I think SDL_PollEvent would be best for most games.
One case in which SDL_WaitEvent might be useful is if you have it in one thread and your animation/logic on another thread. That way even if SDL_WaitEvent waits for a long time, your game will continue painting/updating. (EDIT: This may not actually work. See Henrik's comment below)
As for SDL_PollEvent using 100% CPU as the article indicated, you could mitigate that by adding a sleep in your loop when you detect that your game is running more than the required frames-per-second.
If you don't need sub-frame precision in your input, and your game is constantly animating, then SDL_PollEvent is appropriate.
Sub-frame precision can be important for, eg. games where the player might want very small increments in movement - quickly tapping and releasing a key has unpredictable behavior if you use the classic lazy method of keydown to mean "velocity = 1" and keyup to mean "velocity = 0" and then you only update position once per frame. If your tap happens to overlap with the frame render then you get one frame-duration of movement, if it does not you get no movement, where what you really want is an amount of movement smaller than the length of a frame based on the timestamps at which the events occurred.
Unfortunately SDL's events don't include the actual event timestamps from the operating system, only the timestamp of the PumpEvents call, and WaitEvent effectively polls at 10ms intervals, so even with WaitEvent running in a separate thread, the most precision you'll get is 10ms (you could maybe approximate smaller by saying if you get a keydown and keyup in the same poll cycle then it's ~5ms).
So if you really want precision timing on your input, you might actually need to write your own version of SDL_WaitEventTimeout with a smaller SDL_Delay, and run that in a separate thread from your main game loop.
Further unfortunately, SDL_PumpEvents must be run on the thread that initialized the video subsystem (per https://wiki.libsdl.org/SDL_PumpEvents ), so the whole idea of running your input loop on another thread to get sub-frame timing is nixed by the SDL framework.
In conclusion, for SDL applications with animation there is no reason to use anything other than SDL_PollEvents. The best you can do for sub-framerate input precision is, if you have time to burn between frames, you have the option of being precise during that time, but then you'll get weird render-duration windows each frame where your input loses precision, so you end up with a different kind of inconsistency.
In general, you should use SDL_WaitEvent rather than SDL_PollEvent to release the CPU to the operating system to handle other tasks, like processing user input. This will manifest to you users as sluggish reaction to user input, since this can cause a delay between when they enter a command and when your application processes the event. By using SDL_WaitEvent instead, the OS can post events to your application more quickly, which improves the perceived performance.
As a side benefit, users on battery powered systems, like laptops and portable devices should see slightly less battery usage since the OS has the opportunity to reduce overall CPU usage since your game isn't using it 100% of the time - it would only be using it when an event actually occurs.
This is a very late response, I know. But this is the thread that tops a Google search on this, so it seems the place to add an alternative suggestion to dealing with this that some might find useful.
You could write your code using SDL_WaitEvent, so that, when your application is not actively animating anything, it'll block and hand the CPU back to the OS.
But then you can send a user-defined message to the queue, from another thread (e.g. the game logic thread), to wake up the main rendering thread with that message. And then it goes through the loop to render a frame, swap and returns back to SDL_WaitEvent again. Where another of these user-defined messages can be waiting to be picked up, to tell it to loop once more.
This sort of structure might be good for an application (or game) where there's a "burst" of animation, but otherwise it's best for it to block and go idle (and save battery on laptops).
For example, a GUI where it animates when you open or close or move windows or hover over buttons, but it's otherwise static content most of the time.
(Or, for a game, though it's animating all the time in-game, it might not need to do that for the pause screen or the game menus. So, you could send the "SDL_ANIMATEEVENT" user-defined message during gameplay, but then, in the game menus and pause screen, just wait for mouse / keyboard events and actually allow the CPU to idle and cool down.)
Indeed, you could have self-triggering animation events. In that the rendering thread is woken up by a "SDL_ANIMATEEVENT" and then one more frame of animation is done. But because the animation is not complete, the rendering thread itself posts a "SDL_ANIMATEEVENT" to its own queue, that'll trigger it to wake up again, when it reaches SDL_WaitEvent.
And another idea there is that SDL events can carry data too. So you could supply, say, an animation ID in "data1" and a "current frame" counter in "data2" with the event. So that when the thread picks up the "SDL_ANIMATEEVENT", the event itself tells it which animation to do and what frame we're currently on.
This is a "best of both worlds" solution, I feel. It can behave like SDL_WaitEvent or SDL_PollEvent at the application's discretion by just sending messages to itself.
For a game, this might not be worth it, as you're updating frames constantly, so there's no big advantage to this and maybe it's not worth bothering with (though even games could benefit from going to 0% CPU usage in the pause screen or in-game menus, to let the CPU cool down and use less laptop battery).
But for something like a GUI - which has more "burst-y" animation - then a mouse event can trigger an animation (e.g. opening a new window, which zooms or slides into view) that sends "SDL_ANIMATEEVENT" back to the queue. And it keeps doing that until the animation is complete, then falls back to normal SDL_WaitEvent behaviour again.
It's an idea that might fit what some people need, so I thought I'd float it here for general consumption.
You could actually initialise the SDL and the window in the main thread and then create 2 more threads for updates(Just updates game states and variables as time passes) and rendering(renders the surfaces accordingly).
Then after all that is done, use SDL_WaitEvent in your main thread to manage SDL_Events. This way you could ensure that event is managed in the same thread that called the sdl_init.
I have been using this method for long to make my games work in windows and linux and have been able to successfully run 3 threads at the same time as mentioned above.
I had to use mutex to make sure that textures/surfaces can be transformed/changed in the update thread as well by pausing the render thread, and the lock is called every once 60 frames, so its not going to cause major perf issues.
This model works best to create event driven games, run time games, or both.

Can I set a single thread's priority above 15 for a normal priority process?

I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.
Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?
Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml
I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.

Priority of kernel modules and SCHED_RR threads

I have an embedded Linux platform (the Beagleboard, running Angstrom Linux) with two devices connected:
a Laser range finder (Hokuyo UTM 30) connected via USB
a custom external board connected via SPI
We have a written a Linux kernel module which is responsible for the SPI data transfer. It has an IRQ handler in which spi_async is called which in turn causes an async callback method to be called.
My C++ application consists of three threads:
a main thread for data processing
a laser polling thread
an SPI polling thread
I am experiencing problems which seem to be caused by how the modules described above interact.
When I switch off the USB device (laser range finder) I receive all SPI messages correctly (1 message every 3ms, message length divided by data rate is <1ms), independent from thread scheduling
When I switch on the USB device and I run my program with normal thread scheduling (SCHED_OTHER, priority 0, no nice level set) about 1% of the messages is "lost" because the callback method of spi_async is running when the next IRQ occurs (I could handle this case differently in order not to loose the messages, so this is not a big issue.)
With the USB device turned on and I run the program with SCHED_RR and
priority = 10 for main thread
priority = 10 for SPI reading thread
priority = 4 for USB/Laser polling thread
then I am loosing 40% of the messages because the IRQ is triggered again before the spi-callback method is called! (I could still maybe find a workaround, but the problem is that I need fast response times which can no longer be reached in this case). I need to use the thread scheduling and the laser device so I am looking for a way to solve this case.
Question 1:
My assumption was that IRQ handlers and the callbacks triggered by spi_async in kernel space have higher priority than any thread running in user space (no matter if SCHED_RR or SCHED_OTHER). This would mean that turning to SCHED_RR in my application shouldn't slow down SPI transfer, but this seems very wrong. Is it?
Question 2:
How can I determine what happens here? Which debugging aids exist? (Or maybe you don't need any further information?) The main question for me is: why do I experience the problems only when the laser device is turned on. Could the USB driver consume so much time?
----- EDIT:
I have made the following observation:
The spi_async's callback calls wake_up_interruptible(&mydata->readq); (with wait_queue_head_t readq;). From the user space (my app) I call a function which results in poll_wait(file, &mydata->readq, wait); When the poll returns the user space calls read().
When my application runs with SCHED_OTHER I can see that the callback method first finishes before the read() method in my kernel module is entered.
When my application runs with SCHED_RR read is entered before exiting the callback.
This seems to proof that the priority of the user space threads is higher than the callback method's context's priority. Is there any way to change this behaviour and still have SCHED_RR for my application's threads?
Not all kernel thread have an RT priority. Imagine a periodically waking up thread that needs to do some background work is waking up. You don't want this thread to preemt your RT thread. So I guess your first assumption is wrong.
Based on your other questions :
your main processing loop receives SPI data through a queue
the spi processing thread feeds the main processing queue
It seems your main processing thread get in the way of the spi driver thread responsible for the spi data transfer.
Here is what happens :
an IRQ is fired
spi_async is called, which means a data transfer is queued, that will be picked up by a thread created by the spi master driver.
spi master thread compete with your main processing thread, the laser thread, but this kernel thread has not RT priority, so it looses every time one of the RR thread is running.
What you can do is going back to normal scheduling, while playing with the various CONFIG_PREEMPT_ options. Or mess with the spi master driver, to ensure that any delayed work is queued with enough priority. Or even not queued at all.