Priority of kernel modules and SCHED_RR threads

Priority of kernel modules and SCHED_RR threads - c++

I have an embedded Linux platform (the Beagleboard, running Angstrom Linux) with two devices connected:
a Laser range finder (Hokuyo UTM 30) connected via USB
a custom external board connected via SPI
We have a written a Linux kernel module which is responsible for the SPI data transfer. It has an IRQ handler in which spi_async is called which in turn causes an async callback method to be called.
My C++ application consists of three threads:
a main thread for data processing
a laser polling thread
an SPI polling thread
I am experiencing problems which seem to be caused by how the modules described above interact.
When I switch off the USB device (laser range finder) I receive all SPI messages correctly (1 message every 3ms, message length divided by data rate is <1ms), independent from thread scheduling
When I switch on the USB device and I run my program with normal thread scheduling (SCHED_OTHER, priority 0, no nice level set) about 1% of the messages is "lost" because the callback method of spi_async is running when the next IRQ occurs (I could handle this case differently in order not to loose the messages, so this is not a big issue.)
With the USB device turned on and I run the program with SCHED_RR and
priority = 10 for main thread
priority = 10 for SPI reading thread
priority = 4 for USB/Laser polling thread
then I am loosing 40% of the messages because the IRQ is triggered again before the spi-callback method is called! (I could still maybe find a workaround, but the problem is that I need fast response times which can no longer be reached in this case). I need to use the thread scheduling and the laser device so I am looking for a way to solve this case.
Question 1:
My assumption was that IRQ handlers and the callbacks triggered by spi_async in kernel space have higher priority than any thread running in user space (no matter if SCHED_RR or SCHED_OTHER). This would mean that turning to SCHED_RR in my application shouldn't slow down SPI transfer, but this seems very wrong. Is it?
Question 2:
How can I determine what happens here? Which debugging aids exist? (Or maybe you don't need any further information?) The main question for me is: why do I experience the problems only when the laser device is turned on. Could the USB driver consume so much time?
----- EDIT:
I have made the following observation:
The spi_async's callback calls wake_up_interruptible(&mydata->readq); (with wait_queue_head_t readq;). From the user space (my app) I call a function which results in poll_wait(file, &mydata->readq, wait); When the poll returns the user space calls read().
When my application runs with SCHED_OTHER I can see that the callback method first finishes before the read() method in my kernel module is entered.
When my application runs with SCHED_RR read is entered before exiting the callback.
This seems to proof that the priority of the user space threads is higher than the callback method's context's priority. Is there any way to change this behaviour and still have SCHED_RR for my application's threads?

Not all kernel thread have an RT priority. Imagine a periodically waking up thread that needs to do some background work is waking up. You don't want this thread to preemt your RT thread. So I guess your first assumption is wrong.
Based on your other questions :
your main processing loop receives SPI data through a queue
the spi processing thread feeds the main processing queue
It seems your main processing thread get in the way of the spi driver thread responsible for the spi data transfer.
Here is what happens :
an IRQ is fired
spi_async is called, which means a data transfer is queued, that will be picked up by a thread created by the spi master driver.
spi master thread compete with your main processing thread, the laser thread, but this kernel thread has not RT priority, so it looses every time one of the RR thread is running.
What you can do is going back to normal scheduling, while playing with the various CONFIG_PREEMPT_ options. Or mess with the spi master driver, to ensure that any delayed work is queued with enough priority. Or even not queued at all.

Related

Multithreading with very low latencies for calls between threads

I'm starting on a new robot control software for our robotic system which has to read out values from multiple, independent sensors and control the motors accordingly.
The software will be running on an i5 PC with PREEMPT_RT Ubuntu. Each sensor device comes with an SDK which I want to run in a separate thread. As soon as they get new values from their sensors (can be up to 50 doubles at once from one sensor) they should update those values in a superior control thread. The update rate depends on the sensor, but will be as fast as 1 kHz. As soon as the "main sensor" got new values and sent them to the control thread, the thread of the main sensor should trigger the control loop in the control thread (with a non-blocking call). The control thread should then compute new values for the motors with the currently stored sensor values and transmit them to the motors. The main sensor that will trigger the control loop is also getting new data at a 1 kHz rate, so this often the control thread must be performed.
I'm unsure now how to approach this. Do you think the threads functionality from C++11 can already solve this? Or should I use something like pthreads or boost?
The main requirements are super low latency (~ 10 µs) to send data (up to 50 doubles) from one thread to another, and the ability to trigger a function (non-blocking) in another thread.
As soon as the sensor threads sent the current data to the control thread, they should continue monitoring the hardware to check for new sensor values and retrieve them. Some of the sensor threads perform extra computations and filtering on the sensor data, that's why I want them to run in extra threads, then taking advantage of the quadcore processor.

DPDK - interrupts rather than polling

Is it possible to configure DPDK so that the NIC sends an interrupt whenever a packet is received (rather than turning off interrupts and having the core poll on the RX queue)? I know this seems counterintuitive but there is a use case I have in mind that could benefit from this.
DPDK claims to allow you to use interrupts for RX queues (you can call rte_eth_dev_rx_intr_enable and pass a port/queue pair as arguments), but upon digging through the code, it seems that this is misleading. There is a polling thread that calls epoll_wait, and upon receipt of a packet, calls eal_intr_process_interrupts. This function then goes through a list of callback functions (which are supposed to be the interrupt handlers) and executes each one. The function then calls epoll_wait again (i.e. it is in an infinite loop).
Is my understanding of how DPDK handles "interrupts" correct? In other words, even if you turn "interrupts" on, DPDK is really just polling in the background and then executing callback functions (so there are no interrupts)?

Is my understanding of how DPDK handles "interrupts" correct?
DPDK is a user space application. Unfortunately, there is no magic way to receive an interrupt callback directly to the user space application.
So NIC interrupts get serviced in kernel any way, then kernel notifies to a user space using an eventfd. User space thread waits for the eventfd notification using epoll_wait.
In other words, even if you turn "interrupts" on, DPDK is really just polling in the background and then executing callback functions (so there are no interrupts)?
If there is no data to receive, DPDK polling thread should block on epoll_wait.

How to know if system has just woken up from a mem sleep?

I have a Qt application that runs on Linux.
The user can switch the system to mem sleep using this application.
Switching to mem sleep is trivial, but catching the wake up event in user space isn't.
My current solution is to use a infinite loop to trap the mem sleep, so that when the system wakes up, my application always continues from a predictable point.
Here is my code:
void MainWindow::memSleep()
{
int fd;
fd = ::open("/sys/power/state", O_RDWR);// see update 1)
QTime start=QTime::currentTime();
write(fd,"mem",3); // command that triggers mem sleep
while(1){
usleep(5000); // delay 5ms
const QTime &end=QTime::currentTime();// check system clock
if(start.msecsTo(end)>5*2){// if time gap is more than 10ms
break; // it means this thread was frozen for more
} // than 5ms, indicating a wake up after a sleep
start=end;
}
:: close(fd); // the end of this function marks a wake up event
}
I described this method as a comment on this question, and it was pointed out that it's not a good solution, which I agree.
Question: Is there a C API that I can use to catch the wake up event?
Update:
1) what is mem sleep?
https://www.kernel.org/doc/Documentation/power/states.txt
The kernel supports up to four system sleep states generically, although three
of them depend on the platform support code to implement the low-level details
for each state.
The states are represented by strings that can be read or written to the
/sys/power/state file. Those strings may be "mem", "standby", "freeze" and
"disk", where the last one always represents hibernation (Suspend-To-Disk) and
the meaning of the remaining ones depends on the relative_sleep_states command
line argument.
2) why do I want to catch the wake up event?
Because some hardware need to be reset after a wake up. A hardware input device generates erroneous input events after system wakes up, so it has to be disabled before sleep(easy) and enable after wake up(this question).
This should/could be handled by the driver in the kernel, which I have access to, or fixed in hardware, which my team can do but does not have the time to do it.(why I, a app developer, need to fix it in user space)
3) constraints
This is embedded linux, kernel 2.6.37, arch:arm, march:omap2, distro:arago. It's not as convenient as PC distros to add packages, not does it have ACPI. And mem sleep support in kernel 2.6.37 isn't mature at all.

Linux device drivers for PCI devices can optionally handle suspend and resume which, presumably, the kernel calls, respectively, just before the system is suspended, and just after resuming from a suspend. The PCI entrypoints are in struct pci_driver.
You could write and install a trivial device driver which does nothing more than sense resume operations and provides an indication to any interested processes. The simplest might be to support a file read() which returns a single byte whenever a resume is sensed. The program only need open the device and leave a thread stuck reading a single character. Whenever the read succeeds, the system just resumed.
More to the point, if the devices your application is handling have device drivers, the drivers should be updated to react appropriately to a resume.

When the system wakes from sleep, it should generate an ACPI event, so acpid should let you detect and handle that: via an /etc/acpi/events script, by connecting to /var/run/acpid.socket, or by using acpi_listen. (acpi_listen should be an easy way to test if this will work.)

Check pm-utils which you can place a hook at /etc/pm/sleep.d
In the hook you can deliver signal to your application, e.g. by kill or any IPC.
You can also let pm-utils to do the computer suspend, which IMO is far more compatible with different configurations.
EDIT:
I'm not familiar with arago but pm-utils comes with arch and ubuntu.
Also note that, on newer system that uses systemd, pm-utils is obsoleted and you should instead put hooks on systemd.
REF: systemd power events

C++ Qt fast timing of asynchronous processes advice

i'm currently dealing with a Qt GUI i have to set up for a measurement device. The device is working with a frame grabber card which grabs images from a line camera really fast. My image processing which is not that complex takes 0.2ms to complete, and it takes about 40ms to display the signal and the processing result with QCustomPlot which is totally okay.
Besides the GUI output the processed signal will also be put out as an analog signal by a NI DAQ device.
My problem is that i have to update the analog signal with a constant frequency and still update the GUI from time to time.
My current approach or idea was to create a data pool thread and two worker threads. One worker thread receives the data from the frame grabber, processes it an updates the data pool. The second worker thread updates the analog channel of the NI DAQ with a certain frequency of about 2-5kHz given by a clock in the NI DAQ device.
And the GUI thread would read the data pool from to time to time to update the signal display with a rate of about 20-30Hz.
I wanted to use the Qt thread management and he signal-and-slot mechanism because of its "simplicity" and because i already worked with threads in combination with Qt and its thread classes.
Is there maybe a better way, does somebody have an idead or any suggestion? Is it possible that i get problems in the timing of the threads?
Furhtermore is it possible to assign one thread to one single CPU core on a multi core CPU, so that this core only processes this single thread?

Is there maybe a better way, does somebody have an idead or any suggestion? Is it possible that i get problems in the timing of the threads?
Signal/Slot mechanism is fine, try it and if you get into performance issues you can still try to find another approach. I used Signal/Slot Mechanism for real-time video processing with QAbstractVideoSurface and Mediaplayer. It worked for me.
Furhtermore is it possible to assign one thread to one single CPU core on a multi core CPU, so that this core only processes this single thread?
Why would you do that? The operating system or threading library has a scheduler, which takes care of such things. As long you got no good reason doing this yourself, you should just use the existing way.

I would try it with three threads: 1)UI thread, 2)grab-and-process thread, 3)analogue output thread.
The trick is to use a triple buffer to connect output of grab-and-process to input of analogue output.
Say, at moment t, thread(2) finishes processing frame[(t+0)%3], change output destination to frame[(t+1)%3] immediately, and notifies thread(3), which is looping through data in frame[(t+2)%3], to switch to frame[(t+0)%3] when appropriate.
I used this technique when I was working on an image processing project that has a 10fps processing frame rate and a 60fps NTSC output frame rate. To eliminate the tearing effect, a circular buffer with three buffers is the least.

Help needed for implementing a thread Monitoring Mechanism

I am working on a multithreaded middleware enviornment. The framework is basically a capturing and streaming framework. So it involves a number of threads.
To give you all a brief idea of the threading architecture:
There are seprate threads for demultiplexer, receiveVideo, DecodeVideo, DisplayVideo etc. Each thread performs its functionlity, for eg:
demultiplexer extracts audio, video packets
receivevideo receives header + payload of video packet & removes payload
DecodeVideo receives payload & decodes payload packet
DisplayVideo receives decoded packets & displays the decoded packets on display
Thus each thread feeds the extracted data to the next thread. The threads share data buffers amongst them and the buffers are synchronised through use of mutexes and semaphores. Similarly, there are other threads for handling ananlogvideo and analogaudio etc.
All the threads are spawned in during initialization but they remain blocked on a semaphore and depending upon the input(analog/digitial) selective semaphores are signalled so that specifc threads get unblocked & move on to do their work. At various stages each thread calls some lower level(driver calls)to get data or write data etc. These calls are blocking and the errors resulting from these calls(driver returning corrupted data, driver stalling) should be handled but are not being handled currently.
I wanted to implement a thread monitoring mechanism where a thread will monitor these worker threads and if an error condition occurs will take some preventive actions. As I understand certain such mechanisms are commonly used like Watchdogs in UI or MMI applications. I am trying to look for something similar.
I am using pthreads and No Boost or STL(its a legacy code, pretty much procedural C++)
Any ideas about specific framework or design patterns or open source projects which do something similar and might help in with ideas for implementing my requirement?

Can you ping the threads - periodically send each one a message on its usual input queue, interleaved with all the other normal stuff, asking it to return its status? When each handler thread gets the message, it loads the message with status stuff - how many messages its processed since the last ping, length of its input/output queue, last time that its driver returned OK, that sort of stats - and queues it back to your Thread Monitoring Mechanism. Your TMM would have to time out the replies in case some thread/s is/are stuck.
You could, maybe, just post one message down the whole chain, each thread adding its own status in different fields. That would mean only one timeout, after which your TMM would have to examine the message to see how far down the chain it got.
There are other things - I like to keep an on-screen dump, on a 1s timer, of the length of queues and depth of buffer pools. If something stuffs, I can usually tell roughly where it is, (eg. a pool is emptying and some queue is growing - the queue comsumer is wasted).
Rgds,
Martin

What about using a signalling system to wake up your monitoring thread when something's gone awry in one of your worker threads. You can emulate the signalling with an ResetEvent of some type.
When an exception occurs in your worker thread, you have some data structure you fill up with the data about the exception and then you can pass that on to your monitoring thread. You wake up the monitoring thread by using the event.
Then the monitoring thread can do what you need it to do.
I'm guessing you don't wish to have your monitoring thread active unless something has gone wrong, right?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js