Kworker threads getting blocked by SCHED_RR userspace threads

Kworker threads getting blocked by SCHED_RR userspace threads - c++

We have a Linux system using kernel 3.14.17, PREEMPT RT. It is a single core system.
For latency issues, our application has some of its threads' scheduling type set to SCHED_RR. However, this causes the kworkers in the kernel to be blocked, as they are only running in mode SCHED_OTHER. This can cause a kind of priority inversion, as a low priority SCHED_RR thread can block a higher priority SHED_RR from receiving the data from the driver.
It is the TTY driver that is being blocked. It uses a work queue in the function tty_flip_buffer_push. Possibly more calls, but that is one we've identified.
Is there any way to easily fix this problem - a RT application being dependent on a kworker? We are hoping we don't have to hack the driver/kernel ourselves. Are there any kernel config options in the RT kernel for this kind of stuff? Can we,
set a SCHED_RR priority for the kworkers?
disable work queues for a specific driver?
If we'd have to hack the driver, we'd probably give it its own work queue, with a SCHED_RR kworker.
Of course, any other solution is also of interest. We can upgrade to a later kernel version if there is some new feature.

The root-cause for this behaviour is tty_flip_buffer_push()
In kernel/drivers/tty/tty_buffer.c:518,
tty_flip_buffer_push schedules an asynchronous task. This is soon executed asynchronously by a kworker thread.
However, if any realtime threads execute on the system and keep it busy then the chance that the kworker thread will execute soon is very less. Eventually once the RT threads relinquish CPU or RT-throttling is triggerred, it might eventually provide the kworker thread a chance to execute.
Older kernels support the low_latency flag within the TTY sub-system.
Prior to Linux kernel v3.15 tty_flip_buffer_push() honored the low_latency flag of the tty port.
If the low_latency flag was set by the UART driver as follows (typically in its .startup() function),
t->uport.state->port.tty->low_latency = 1;
then tty_flip_buffer_push() perform a synchronous copy in the context of the current function call itself. Thus it automatically inherits the priority of the current task i.e. there is no chance of a priority inversion incurred by asynchronously scheduling a work task.
Note: If the serial driver sets the low_latency flag, it must avoid calling tty_flip_buffer_push() within an ISR(interrupt context). With the low_latency flag set, tty_flip_buffer_push() does NOT use separate workqueue, but directly calls the functions. So if called within an interrupt context, the ISR will take longer to execute. This will increase latency of other parts of the kernel/system. Also under certain conditions (dpeending on how much data is available in the serial buffer) tty_flip_buffer_push() may attempt to sleep (acquire a mutex). Calling sleep within an ISR in the Linux kernel causes a kernel bug.
With the workqueue implementation within the Linux kernel having migrated to CMWQ,
it is no longer possible to deterministically obtain independent execution contexts
(i.e. separate threads) for individual workqueues.
All workqueues in the system are backed by kworker/* threads in the system.
NOTE: THIS SECTION IS OBSOLETE!!
Leaving the following intact as a reference for older versions of the Linux kernel.
Customisations for low-latency/real-time UART/TTY:
1. Create and use a personal workqueue for the TTY layer.
Create a new workqueue in tty_init().
A workqueue created with create_workqueue() will have 1 worker thread for each CPU on the system.
struct workqueue_struct *create_workqueue(const char *name);
Using create_singlethread_workqueue() instead, creates a workqueue with a single kworker process
struct workqueue_struct *create_singlethread_workqueue(const char *name);
2. Use the private workqueue.
Queue the flip buffer work on the above private workqueue instead of the kernel's global global workqueue.
int queue_work(struct workqueue_struct *queue, struct work_struct *work);
Replace schedule_work() with queue_work() in functions called by tty_flip_buffer_push().
3. Tweak the execution priority of the private workqueue.
Upon boot the kworker thread being used by TTY layer workqueue can be identified by the string name used while creating it. Set an appropriate higher RT priority using chrt upon this thread as required by the system design.

Related

How to create a user space thread? [duplicate]

I am just started coding of device driver and new to threading, went through many documents for getting an idea about threads. I still have some doubts.
what is a kernel thread?
how it differs from user thread?
what is the relationship between the two threads?
how can i implement kernel threads?
where can i see the output of the implementation?
Can anyone help me?
Thanks.

A kernel thread is a task_struct with no userspace components.
Besides the lack of userspace, it has different ancestors (kthreadd kernel thread instead of the init process) and is created by a kernel-only API instead of sequences of clone from fork/exec system calls.
Two kernel threads have kthreadd as a parent. Apart from that, kernel threads enjoy the same "independence" one from another as userspace processes.
Use the kthread_run function/macro from the kthread.h header You will most probably have to write a kernel module in order to call this function, so you should take a look a the Linux Device Drivers
If you are referring to the text output of your implementation (via printk calls), you can see this output in the kernel log using the dmesg command.

A kernel thread is a kernel task running only in kernel mode; it usually has not been created by fork() or clone() system calls. An example is kworker or kswapd.
You probably should not implement kernel threads if you don't know what they are.
Google gives many pages about kernel threads, e.g. Frey's page.

user threads & stack:
Each thread has its own stack so that it can use its own local variables, thread’s share global variables which are part of .data or .bss sections of linux executable.
Since threads share global variables i.e we use synchronization mechanisms like mutex when we want to access/modify global variables in multi threaded application. Local variables are part of thread individual stack, so no need of any synchronization.
Kernel threads
Kernel threads have emerged from the need to run kernel code in process context. Kernel threads are the basis of the workqueue mechanism. Essentially, a thread kernel is a thread that only runs in kernel mode and has no user address space or other user attributes.
To create a thread kernel, use kthread_create():
#include <linux/kthread.h>
structure task_struct *kthread_create(int (*threadfn)(void *data),
void *data, const char namefmt[], ...);
kernel threads & stack:
Kernel threads are used to do post processing tasks for kernel like pdf flush threads, workq threads etc.
Kernel threads are basically new process only without address space(can be created using clone() call with required flags), means they can’t switch to user-space. kernel threads are schedulable and preempt-able as normal processes.
kernel threads have their own stacks, which they use to manage local info.
More about kernel stacks:-
https://www.kernel.org/doc/Documentation/x86/kernel-stacks

Since you're comparing kernel threads with user[land] threads, I assume you mean something like the following.
The normal way of implementing threads nowadays is to do it in the kernel, so those can be considered "normal" threads. It's however also possible to do it in userland, using signals such as SIGALRM, whose handler will save the current process state (registers, mostly) and change them to another one previously saved. Several OSes used this as a way to implement threads before they got proper kernel thread support. They can be faster, since you don't have to go into kernel mode, but in practice they've faded away.
There's also cooperative userland threads, where one thread runs until it calls a special function (usually called yield), which then switches to another thread in a similar way as with SIGALRM above. The advantage here is that the program is in total control, which can be useful when you have timing concerns (a game for example). You also don't have to care much about thread safety. The big disadvantage is that only one thread can run at a time, and therefore this method is also uncommon now that processors have multiple cores.
Kernel threads are implemented in the kernel. Perhaps you meant how to use them? The most common way is to call pthread_create.

dispatcher in real time operating system

I am reading about real time concepts at following link
http://www.embeddedlinux.org.cn/RTConforEmbSys/5107final/LiB0024.html
Here in section 4.4.4 it is mentioned has
The dispatcher is the part of the scheduler that performs context
switching and changes the flow of execution. At any time an RTOS is
running, the flow of execution, also known as flow of control, is
passing through one of three areas: through an application task,
through an ISR, or through the kernel. When a task or ISR makes a
system call, the flow of control passes to the kernel to execute one
of the system routines provided by the kernel. When it is time to
leave the kernel, the dispatcher is responsible for passing control
to one of the tasks in the user’s application. It will not necessarily
be the same task that made the system call.
It is the scheduling algorithms (to be discussed shortly) of the
scheduler that determines which task executes next. It is the
dispatcher that does the actual work of context switching and passing
execution control.
Depending on how the kernel is first entered, dispatching can happen
differently. When a task makes system calls, the dispatcher is used to
exit the kernel after every system call completes. In this case, the
dispatcher is used on a call-by-call basis so that it can coordinate
task-state transitions that any of the system calls might have
caused. (One or more tasks may have become ready to run, for example.)
On the other hand, if an ISR makes system calls, the dispatcher is
bypassed until the ISR fully completes its execution. This process is
true even if some resources have been freed that would normally
trigger a context switch between tasks. These context switches do not
take place because the ISR must complete without being interrupted by
tasks. After the ISR completes execution, the kernel exits through the
dispatcher so that it can then dispatch the correct task
.
My question on above text is
What does author mean by "When a task makes system calls, the dispatcher is used to exit the kernel
after every system call completes. In this case, the dispatcher is used on a call-by-call basis so that it can coordinate task-state transitions that any of
the system calls might have caused.". Specifically Here what does author mean by dispatcher is used to exit the kernel.
Thanks!

The author is presenting a simplified explanation of a real-time system architecture where a thread of control can be in one of three states - kernel mode (system calls), application mode (TASK), or interrupt service routine (ISR) mode.
The dispatcher in this example is a kernel routine that decides the application TASK that is to receive control after exiting each system call made by one of the application TASKs. This could be the TASK that issued the system call or it could be a different TASK depending on the dispatching algorithms and rules that are being followed.
There are many variations of dispatching rules and algorithms based on the planned usage of the system; As an example you might think of giving each TASK an equal amount of CPU time per minute - so if 3 application TASKS are being executed each one is supposed to receive 20 seconds of CPU time every minute. The dispatcher in this case could decide the next TASK to receive control is the TASK with the smallest accumulated CPU time during the last minute in an attempt to equally distribute the CPU time per minute across the TASKS.
After deciding which TASK is to next receive control the dispatcher will exit the mode of a system call and transfer control to the application TASK so invoking the dispatcher is the equivalent of "exiting" the kernel and transferring control to an eligible application TASK.
The author states that this is real-time system which means emphasis will be given to the quick processing of interrupts (via ISRs) over the processing of applications (TASKS). To minimize the amount of time consumed by each interrupt when an ISR issues a system call the kernel will directly returns to that ISR and not "exit via the dispatcher" which would allow control is to be given to an application TASK.
When the ISR has completed its processing, its exit will be performed in a manner that causes the kernel to invoke the dispatcher (hence it will exit via the dispatcher) so an application TASK can once again use the CPU.
As an additional note: one of the hidden assumptions in this explanation is that the same kernel routines (system calls) can be invoked by application TASKS and interrupt service routines (ISR). This is very common but security and performance issues might require different sets of kernel routines (system calls) for ISRs and TASKS.

After the system call finishes executing control has to be passed back to a user space task. There likely are many user space tasks waiting to be run and they all might have different priorities. The dispatcher uses it's algorithm to evaluate waiting tasks based on priority and other criteria (how long have they been waiting? How long do I anticipate they need?) and then starts one of them.
For example you might have an application that needs to read input from the command line. So your application calls the read() system call which passes control to the kernel. After the read() is complete, the dispatcher evaluates the tasks waiting to run and may decide another task should be run other than the one that called the read()

How to reduce CUDA synchronize latency / delay

This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands
cudaStreamSynchronize,
CudaDeviceSynchronize,
cudaThreadSynchronize,
and also cudaStreamQuery to check if streams are empty.
I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.
Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?

The main difference between synchronize methods is "polling" and "blocking."
"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.
"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.
The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).

How can I avoid preemption of my thread in user mode

I have a simple chunk of deterministic work that only takes thirteen machine instructions to complete. Because the first instruction takes a homemade semaphore (spinlock) and the last instruction releases it, I am safe from all of the other threads running on the other cores as they are attempting to take and give the same semaphore.
The problem arises when some thread interrupts a thread holding the semaphore before it can finish its "critical section". Worst case the interruption kills the thread while holding the semaphore or as can happen one of the threads normally competing for the semaphore branches out into code that can generate the interrupt causing a deadlock.
I don't have a way synchronizing with these other threads when they branch into those parts of the code I can't control. I think I need to disable interrupts like I used to do in my old VxWorks days when I was running in kernel mode. Its always thirteen instructions and I am always completely safe if I can get all thirteen instructions done before I have to honor an interrupt. Oh and it is all my own internal data, other that the homemade semaphore there is nothing that locks anything else up.
I have read several answers that I think are close. Most have to do with Critical Section calls on the Windows API (wrong OS but maybe the right concept). Most of the wrong solutions assume that I can get all of the offending threads to use a mutex that I create with the pthread libraries.
I need this solution in C/C++ on Linux and Solaris.
Johnny Crash's question is very close
prevent linux thread from being interrupted by scheduler
KermitG also
Can I prevent a Linux user space pthread yielding in critical code?
Thanks for your consideration.

You may not prevent preemption of a user-mode thread. Critical sections (and all other sync objects) prevent collisions of your threads, however they by no means prevent them from preemption by the OS.
If your other threads branch into something on timeout, whereas that something may lead to a deadlock - you have a design problem.
A correct design should be the most pessimistic: preemption may occur everywhere for indeterminate time.

Yes, yes - 7 years old - I need to do exactly this but for other reasons.
So I put this here for others to read in a historical context.
I am writing an emulation layer for an embedded RTOS where I need to emulate the embedded platform CPU_irq_disable(), and CPU_irq_restore() The closest thing I can think of is to disable peemption in the scheduler.
Yes, the target does have an RTOS - sometimes it does not.
IO is emulated via sockets, ie: a serial port is like a stream socket!
A GPIO pin (Edge IRQ) can be a socket to. The current value is in a quasi-global to the driver, and to wait for a pin change = waiting for a packet to arrive on a socket.
So the socket read thread acts like an IRQ when a packet shows up.
Thus- to emulate irq disable, it is reasonable to emulate by disabling pre-emption within my own application.
Also at the embedded application layer, I need to emulate what would be a superloop.
No amount of mutex stuff is going to emulate the embedded platform reasonably.

Pattern for realizing low priority background threads?

I have a (soft) realtime system which queries some sensor data, does some processing and then waits for the next set of sensor data. The sensor data are read in a receiver thread and put into a queue, so the main thread is "sleeping" (by means of a mutex) until the new data has arrived.
There are other tasks like logging or some long-term calculations in the background to do. These are implemented to run in other threads.
However, it is important that while the main thread processes the sensor data, it should have highest priority which means that the others threads should not consume any CPU resources at all if possible (currently the background threads cause the main thread to slow down in an unacceptable way.)
According to Setting thread priority in Linux with Boost there is doubt that setting thread priorities will do the job. I am wondering how I can measure which effect setting thread priorities really has? (Platform: Angstrom Linux, ARM PC)
Is there a way to "pause" and "continue" threads completely?
Is there a pattern in C++ to maybe realize the pause/continue on my own? (I might be able to split the background work into small chunks and I could check after every chunk of work if I am allowed to continue, but the question is how big these chunks should be etc.)
Thanks for your thoughts!

Your problem is with OS scheduler, not the C++. You need to have a real real-time scheduler that will block lower priority threads while the higher priority thread is running.
Most "standard" PC schedulers are not real-time. There's an RT scheduler for Linux - use it. Start with reading about SCHED_RR and SCHED_FIFO, and the nice command.
In many systems, you'll have to spawn a task (using fork) to ensure the nice levels and the RT scheduler are actually effective, you have to read through the manuals of your system and figure out which scheduling modules you have and how are they implemented.

There is no portable way to set the priority in Boost::Thread. The reason is that different OSs will have different API for setting the priority (e.g. Windows and Linux).
The best way to set the priority in a portable way is to write a wrapper to boost::thread with a uniform API that internally gets the thread native_handle, and then uses the OS specific API (for example, in Linux you can use sched_setscheduler()).
You can see an example here:
https://sourceforge.net/projects/threadutility/
(code made by a student of mine, look at the svn repository)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js