How to test Interrupt Latency?

How to test Interrupt Latency? - c++

Windows Embedded Compact 7.
Is there a way to test interrupt latency time from user space?
Are there any tools provided as part of platform builder?
I also saw a program called Intrtime.exe - but no examples on how to use it.
How does one test the interrupt latency time?
Reference for Intrtime.exe but how do I implement it?
http://www.ece.ufrgs.br/~cpereira/temporeal_pos/www/WindowsCE2RT.htm
EDIT
Also found:
ILTiming.exe Real-Time Measurement Tool (Compact 2013)
http://msdn.microsoft.com/en-us/library/ee483144.aspx

This really is a test that requires hardware, and there are a couple "latencies" you might measure. Once is the time from the interrupt signal to when the driver ISR reacts and the second is from when the interrupt occurs to when an IST reacts.
I did this back in the CE 3.0/CE 4.0 days by attaching a signal generator to an interruptable input an then having an ISR pulse a second input and an IST pulse a third input when they received the interrupt. I hooked a scope up to the input and outputs and used it to measure time between the input signal and output signals to get not just latency, but also jitter. You could easily add a 4th line for CE 7 so you could check an IST in user space and an IST in kernel space. I'd definitely be interested to see the results.
I don't think you can effectively measure this with software running on the platform, as you get into the problem of the code trying to do the measurement affecting the results. You're also talking time way, way below the system tick resolution so the scheduler is going to be problematic as well. CeLog might be able to get you an idea on these times, but getting it set up and running is probably more work than just hooking up a scope.

What is usually meant by interrupt latency is the time between an interrupt source asserting the interrupt line and a thread (sometimes in user-space) being scheduled and then executing as a result.
Unless your CPU has some accurate way of time-stamping interrupt events as they arrive at the CPU (rather than when an ISR runs), the only truly accurate measurement is one done externally - by measuring the time between a the interrupt line being asserted and some observable signal that the thread responding to the interrupt can control. A DSO or logic analyser is usually used for this purpose.
Software techniques usually rely on storing an accurate time-stamp at the earliest opportunity in an ISR. If you're certain the time between interrupt line becoming asserted and the ISR running is negligible, this might be valid. If, on the other hand, disabling of interrupts is being used to control concurrency, or interrupts are nested, you probably want to be measuring this as well.

Related

Adapting program from single to multicore

I am considering a programming project. Will run under Ubuntu or other Linux OS on a small board. Quad core x86 - N-Series Pentium. The software generates 8 fast signals; square wave pulse trains for stepper motor motion control of 4 axes. Step signals being 50-100 KHz maximum, but usually slower. Want to avoid jitter in these stepping signals (call it good fidelity), so that around 1-2us for each thread loop cycle would be a nice target. The program does other kinds of tasks, like read/write hard drive, Ethernet, continues update on the graphics display, keyboard. The existing Single core programs just can not process motion signals with this kind of timing and require external hardware/techniques to achieve this.
I have been reading other posts here, like on a thread running selected core, continuously. The exact meaning in these posts is "lose", not sure really what is meant. Continuous might mean testing every minute or ?????
So, I might be wordy, but it will clear I hope. The considered program has all the threads, routines, memory, shared memory all included. I am not considering that this program launches another program or service. Other threads are written in this program and launched when the program starts up. I will call this signal generating thread the FAST THREAD.
The FAST THREAD is to be launched to an otherwise "free" core. It needs to be the only thread that runs on the core. Hopefully, the OS thread scheduler component on that core can be "turned off", so that it does not even interrupt on that core to decide what thread runs next. In looking at the processor manual, Each core has a counter timer chip. Is it possible then that I can use it to provide a continuous train of interrupts then into my "locked in" FAST THREAD for timing purposes? This is the range of about 1-2 us. If not, then just reading one channel on that CTC to provide software sync. This fast thread will, therefore, see (experience) no delays from the interrupts issued in the other cores and associated multicore fabric. This FAST THREAD, when running, will continue to run until the program closes. This could be hours.
Input data to drive this FAST THREAD will be common shared memory defined in the program. There are also hardware signals for motion limits (From GPIOs or SDI port). If any go TRUE, that forces a programmed halt all motion. It does not need a 1~2us response. It could go to a slower Motion loop.
Ah, the output:
Some motion data is written back to the shared memory (assigned for this purpose). Like current location, and current loop number,
Some signals need to be output (the 8 outputs). There are numerous free GPIOs. Not sure of the path taken to get the signaled GPIO pin to change the output. The system call to Linux initiates the pin change event. There is also an SDI port available, running up to the 25Mhz clock. It seems these ports (GPIO, UART, USB, SDI) exist in the fabric that is not on any specific core. I am not sure of the latency from the issuance of these signals in the program until the associated external pin actually presents that signal. In the fast thread, even 10us would be OK, if it was always the same latency! I know that will not be so, there will jitter. I need to think on this spec.
There will possibly be a second dedicated core (similar to above) for slower motion planning. That leaves two cores for everything thing else. Since then everything else items (sata, video screen, keyboard ...) are already able to work in a single core, then the remaining two cores should be great.
At close of program, the FAST THREAD returns the CTC and any other device on its core back to "as it was", re-enables the OS components in this core to their more normal operation. End of thread.
Concluding: I have described the overall program, so as for you to understand what I want to do with this FAST THREAD running, how responsive it needs to be, and that it needs to be undisturbed!! This processor runs in the 1.5 ~ 2.0 GHz range. It certainly can do the repeated calculations in the required time frame.
DESIRED: I do not know the system calls that would allow me to use a selected x86 core in this way. Any pointers would be helpful. Any manual or document that described these calls/procedures.
Can this use of a core also be done in windows 7,10)?
Thanks for reading and any pointers you have.
Stan

How do I measure GPU time on Metal?

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.

There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.

You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.

As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc

Using timers with performance-critical software (Qt)

I am developing an application that is responsible of moving and managing robots over an UDP connection.
The application needs to:
Read joystick/user input using SDL.
Generate and send a control packet to the robot every 20 milliseconds (UDP)
Receive and decode response packets from the robot (~20 msecs). This was implemented with the signal/slot mechanism and does not require a timer.
Receive and process robot messages for debugging reasons. This is not time-regulated.
Update the UI regularly to keep the user notified about the status of the robot (e.g. battery voltage). For most cases, I have also used Qt's signal/slot mechanism.
Use a watchdog that disables the robot if no response is received after 1 second. The watchdog is reset when the application receives a robot packet (~20 msecs)
For the moment, I have implemented all of the above. However, the application fails to send the packets regularly when the watchdog is activated or when two or more QTimer objects are used. The application would generally work, but I would not consider it "production ready". I have tried to use the precision flags of the timers (Qt::Precise, Qt::Coarse and Qt::VeryCoarse), but I still experienced problems.
Notes:
The code is generally well organized, there are no "god objects" in the code base (most source files are less than 150 lines long and only create the necessary dependencies).
Most of the times, I use QTimer::singleShot() (e.g. I will only send the next packet once the current packet has been sent).
Where we use timers:
To read joystick input (~50 msecs, precise timer)
To send robot packets (~20 msecs, precise timer)
To update some aspects of the UI (~500 msecs, coarse timer)
To update the elapsed time since the robot was enabled (~100 msecs, precise timer)
To implement a watchdog (put the application and robot in safe state if 1000 msecs have passed without a robot response)
Note: the watchdog is feed when we receive a response packet from the robot (~20 msecs)
Do you have any recommendations for using QTimer objects with performance-critical code (any idea is welcome). Note that I have also tried to use different threads, but it has caused me more problems, since the application would not be in "sync", thus failing to effectively control the robots that we have tested.

Actually, I seem to have underestimated Qt's timer and event loop performance. On my system I get on average around 20k nanoseconds for an event loop cycle plus the overhead from scheduling a queued function call, and a timer with interval 1 millisecond is rarely late, most of the timeouts are a few thousand nanoseconds short of a millisecond. But it is a high end system, on embedded hardware it may be a lot worse.
You should take the time and profile your target system and Qt build to determine whether it can indeed run snappy enough, and based on those measurements, adjust your timings to compensate for the system delays to get your events scheduled more on time.
You should definitely keep the timer thread as free as possible, because if you block it by IO or extensive computation, your timer will not be accurate. Use a dedicated thread to schedule work and extra worker threads to do the actual work. You may also try playing with thread priorities a bit.
Worst case scenario, look for 3rd party high performance event loop implementations or create your own and potentially, also a faster signaling mechanism as ell. As I already mentioned in the comments, Qt's inter-thread queued signals are very slow, at least compared to something like indirect function calls.
Last but not least, if you want to do task X every N units of time, it will only be only possible if task X takes N units of time or less on your system. You need to make this consideration for each task, and for all tasks running concurrently. And in order to get accurate scheduling, you should measure how long did task X took, and if less than its frequency, schedule the next execution in the time remaining, otherwise execute immediately.

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Can I set a single thread's priority above 15 for a normal priority process?

I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.

Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?

Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml

I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js