linux process scheduling delayed for long time

linux process scheduling delayed for long time - c++

I have done strace on my multi-threaded c++ application running on linux
after couple hours of running, none of the threads got run, for about 12 seconds.
I have seen that the unfinished select system call which is called with a timeout was unfinished before the thread was suspended, reported after it resumed that, it took 11.x seconds for the operation to finish. (the timeout is only 900ms)
This is clear indication that the process got starved for a long time.
All threads in the process are created with default scheduling policy(SCHED_OTHER) of linux and default priority.
There are another 5 similar apps running on the same box which are also heavy I/O bound like this app due to heavy data received on the socket. But most of the time, this app is getting scheduled delay. The other apps are created with same sched policy and priority as this i.e. the defaults. why is only this process gets blocked almost all of the time?
Could it be because this process is more I/O intensive as in more busy due to may be higher rates of data? So, the linux dynamic priority adjusting in play here which pushed this process down?

Priority and process scheduling in Linux is only related to CPU time. In fact the Process Scheduler only cares about processes which wait to run on the CPU. Processes/threads which wait for I/O are not handled by the Process Scheduler but the I/O Scheduler.

Related

SCHED_FIFO thread freezes terminal

I have a centos minimal hexacore 3.5ghz machine and I do not undestand why a SCHED_FIFO realtime thread pinned to 1 core only, freezes the terminal? How to avoid this while keeping the realtime behaviour of the thread without using sleep in the loop or blocking it? To simplify my problem, this thread tries to dequeuue items from a non-blocking,lockfree,concurrent queue in an infinite loop.
The kernel runs on core 0, all the other cores are free. All other threads and my process too, are SCHED_OTHER same priority, 20. This is the only thread where i need ultra low latency for some high frequency calculations. After starting the application it seems everything works ok but my terminal freezes (i connect remotely trough ssh). I am able to see the threads created and force close my app from htop. The RT thread seems to run 100% burnout the core assigned as expected. When i kill the app, the terminal frozen is released and i can use again.
It looks like that thread has higher priorty than everything else across all cores, but i want this on the core i pinned it only.
Thank you

Hi victor you need to isolate the core from the linux scheduler so that it does not try to assign lower priority tasks such as running your terminal to a core that is running SCHED_* jobs with higher priority. You can achieve isolating core 1 in your case by adding the kernel option isolcpus=1 to your grub.cfg (or whatever boot loader config you are using).
After rebooting you can confirm that you have successfully isolated core 1 by running dmesg | grep isol
and see that your kernel was booted with the option.
Here is some more info on isolcpus:
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html

Detecting and recovering from Windows TDR?

I've run into an odd issue with some OpenCL code that I'm working on where every once in a blue moon, Windows TDR will kick in and reset the GPU. The offending kernel runs for only 150ms and will run thousands of times (over the course of many hours) before the TDR kills it off, so I'm certain that the kernel itself isn't to blame.
My concern is that once the TDR kicks in, the kernel dies and the program is stuck in an eternal state of limbo. From what I can tell the call to clFinish never returns.
Is there a way to detect if a kernel has died off so that it can be handled gracefully?

I managed to come up with a solution, although it's far from optimal.
I've modified the program so that the OpenCL processing is done in a separate thread. I created a global shared watchdog variable between the parent and child process. When the parent spawns the processing function as a thread, it sets the variable to the current time in milliseconds. When the processing thread finishes, it reset the watchdog variable to zero.
While the parent thread waits for the processing thread to finish, it keeps an eye on the watchdog timer. If the timer exceeds a certain threshold then the program forcefully terminates itself without waiting for the child process to return.
This solution works with or without Windows TDR set. If TDR is set and the driver resets, the call to clFinish() will never return and the parent will terminate once the watchdog timer trips. If TDR is not set, the runaway process will freeze the display, but once the watchdog timer trips, the parent will terminate processing, ending the freeze.
Now that I have a watchdog set up, I simply wrapped my program in a script: if it terminated in error (positive return code) then the program is rerun.

Ideally, you should get an error code from clFinish or clWaitForEvents with the OpenCL event object generated when enqueuing the kernel. Since TDR resets the graphics driver, I don't think any OpenCL implementation will work reliably, meaning there is no recovery route.
Rather disable TDR completely. It is only worthwhile when you debug code that gets stuck in an infinite loop that permanently keeps the GPU busy.
If you want to keep TDR but can change the code then using some sort of thread sleep function to delay your code for a few milliseconds could also alleviate this problem, at the expense of sacrificing processing speed. This gives the graphics card a chance to respond to display rendering commands so that TDR is not triggered.

QProcess ProcessState sufficient for Blocked Processes?

I want to know if a process (started with a QProcess class) doesn't respond anymore. For instance, my process is an application that only prints 1 every seconds.
My problem is that I want to know if (for some mystical reason), that process is blocked for a short period of time (more than 1 second, something noticeable by a human).
However, the different states of a QProcess (Not Running, Starting, Running) don't include a "Blocked" state.
I mean blocked as "Don't Answer to the OS" when we got the "Non Responding" message in the Task Manager. Such as when a Windows MMI (like explorer.exe) is blocked and becomes white.
But : I want to detect that "Not Responding" state for ANY processes. Not just MMI.
Is there a way to detect such a state ?

Qt doesn't provide any api for that. You'd need to use platform-specific mechanisms. On some platforms (Windows!), there is no notion of a hung application, merely that of a hung window. You can have one application that has both responsive and unresponsive windows :)
On Windows, you'd enumerate all windows using EnumWindows, check if they belong to your process by comparing the pid from GetWindowThreadProcessId to process->pid(), and finally checking if the window is hung through IsHungAppWindow.
Caveats
Generally, there's is no such thing as an all-encompassing notion of a "non responding" process.
Suppose you have a web server. What does it mean that it's not responding? It's under heavy load, so it may deny some incoming connections. Is that "non responding" from your perspective? It may be, but there's nothing you can do about it - killing and restarting the process won't fix it. If anything, it will make things worse for the already connected clients.
Suppose you have a process that is blocking on a filesystem read because the particular drive it tries to access is slow, or under heavy load. Does it mean that it's not responding? Will killing and restarting it always fix this? If the process then retries the read from the beginning of the file, it may well make things worse.
Suppose you have a poorly designed process with a GUI. It's doing blocking serial port reads in the GUI thread. The read it's doing takes long time, and the GUI is nonresponsive for several seconds. You kill the process, it restarts and tries that long read again - you've only made things worse.
You have to tread very carefully here.
Solution Ideas
There are multiple approaches to determining what is a "responsive" process. It was already mentioned that processes with a GUI are monitored by the operating system on both Windows and OS X. Thus one can use native APIs that can query whether a window or a process is hung or not. This makes sense for applications that offer a UI, and subject to caveats above.
If the process is providing a service, you may periodically use the service to determine if it's still available, subject to some deadlines. Any elections as to what to do with a "hung" process should take into account CPU and I/O load of the system.
It may be worthwhile to keep a history of the latency of the service's response to the service request. Only "large" changes to the latency should be taken to be an indication of a problem. Suppose you're keeping track of the average latency. One could have set an ultimate deadline to 50x the previous average latency. Missing this deadline, the service is presumed dead and up for forced recycling. An "action flag" deadline may be set to 5-10x the average latency. A human would then be given an option to orderly restart the service. The flag would be automatically removed when latency backs down to, say, 30% below the deadline that triggered the flag.
If you are the developer of the monitored process, then you can invert the monitoring aspect and become a passive watchdog of the monitored process. The monitored process must then periodically, actively "wake" the watchdog to indicate that it's alive. The emission of the wake signal (in generic terms) should be performed in strategic location(s) in the code. Periodic reception of wake "signals" should allow you to reason that the process is still alive. You may have multiple wake signals, tagged with the location in the watched process. Everything depends on how many threads the process has, what is it doing, etc.

Can I set a single thread's priority above 15 for a normal priority process?

I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.

Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?

Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml

I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.

Priority of kernel modules and SCHED_RR threads

I have an embedded Linux platform (the Beagleboard, running Angstrom Linux) with two devices connected:
a Laser range finder (Hokuyo UTM 30) connected via USB
a custom external board connected via SPI
We have a written a Linux kernel module which is responsible for the SPI data transfer. It has an IRQ handler in which spi_async is called which in turn causes an async callback method to be called.
My C++ application consists of three threads:
a main thread for data processing
a laser polling thread
an SPI polling thread
I am experiencing problems which seem to be caused by how the modules described above interact.
When I switch off the USB device (laser range finder) I receive all SPI messages correctly (1 message every 3ms, message length divided by data rate is <1ms), independent from thread scheduling
When I switch on the USB device and I run my program with normal thread scheduling (SCHED_OTHER, priority 0, no nice level set) about 1% of the messages is "lost" because the callback method of spi_async is running when the next IRQ occurs (I could handle this case differently in order not to loose the messages, so this is not a big issue.)
With the USB device turned on and I run the program with SCHED_RR and
priority = 10 for main thread
priority = 10 for SPI reading thread
priority = 4 for USB/Laser polling thread
then I am loosing 40% of the messages because the IRQ is triggered again before the spi-callback method is called! (I could still maybe find a workaround, but the problem is that I need fast response times which can no longer be reached in this case). I need to use the thread scheduling and the laser device so I am looking for a way to solve this case.
Question 1:
My assumption was that IRQ handlers and the callbacks triggered by spi_async in kernel space have higher priority than any thread running in user space (no matter if SCHED_RR or SCHED_OTHER). This would mean that turning to SCHED_RR in my application shouldn't slow down SPI transfer, but this seems very wrong. Is it?
Question 2:
How can I determine what happens here? Which debugging aids exist? (Or maybe you don't need any further information?) The main question for me is: why do I experience the problems only when the laser device is turned on. Could the USB driver consume so much time?
----- EDIT:
I have made the following observation:
The spi_async's callback calls wake_up_interruptible(&mydata->readq); (with wait_queue_head_t readq;). From the user space (my app) I call a function which results in poll_wait(file, &mydata->readq, wait); When the poll returns the user space calls read().
When my application runs with SCHED_OTHER I can see that the callback method first finishes before the read() method in my kernel module is entered.
When my application runs with SCHED_RR read is entered before exiting the callback.
This seems to proof that the priority of the user space threads is higher than the callback method's context's priority. Is there any way to change this behaviour and still have SCHED_RR for my application's threads?

Not all kernel thread have an RT priority. Imagine a periodically waking up thread that needs to do some background work is waking up. You don't want this thread to preemt your RT thread. So I guess your first assumption is wrong.
Based on your other questions :
your main processing loop receives SPI data through a queue
the spi processing thread feeds the main processing queue
It seems your main processing thread get in the way of the spi driver thread responsible for the spi data transfer.
Here is what happens :
an IRQ is fired
spi_async is called, which means a data transfer is queued, that will be picked up by a thread created by the spi master driver.
spi master thread compete with your main processing thread, the laser thread, but this kernel thread has not RT priority, so it looses every time one of the RR thread is running.
What you can do is going back to normal scheduling, while playing with the various CONFIG_PREEMPT_ options. Or mess with the spi master driver, to ensure that any delayed work is queued with enough priority. Or even not queued at all.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js