Time Short Functions with cpu time using RTEMS operating system - c++

I am looking to profile some code in a real time operating system, RTEMS. Essentially, rtems has a bunch of functions to read the time, the most useful of which is rtems_clock_get_ticks_since_boot.
The problem here is that for whatever reason the clock ticks reported are synchronized with our state machine loop rate, 5kHz whereas the processor is running at around 200MHz (embedded system). I know this because i recorded the clock time, waited 1 sec and only 5000 ticks had gone by.
So the question is:
How can I get the actual CPU ticks from RTEMS?
PS.
clock() from GNU C (has the same problem)
There is a guide that i have been looking into here, but I get impossible constraint in asm which indicates that i would need to use some different assembler keywords. Maybe someone can point me to something similar?
Context
I want to profile some code, so essentially:
start = cpu_clock_ticks()
//Some code
time = cpu_clock_ticks() - start;
The code runs in less than 0.125ms so the 8khz counter that clock() and other rtems functions get, wont cut it.

Accurate performance measurements can be made using an oscilloscope, provided that there is a GPIO, test point or pin that the software can write to (and the oscilloscope probe can attach to).
The method here is to send a pulse to the pin. The o'scope can be set up to trigger on the pulse. Some smarter o'scopes can perform statistics on the pulse width, such as mean time and maximum time.
On our embedded system, the H/W team was nice enough to bring out 8 test points for us to use. We initialize the pin to zero. At the start of the code to profile, we write a 1 to the pin. At the end of the profiling code, we write a 0 to the pin. This produces a pulse or square wave.
The o'scope is set up to trigger on the rising edge. The probe is connected to the pin and the program is run. Adjust the o'scope so the the entire pulse is visible on the screen. Re-run the program. When the o'scope triggers, measure the width of the pulse. This will be the actual time of the execution.

So a solution to this is to use the following function:
inline unsigned long timer_now() {
unsigned int time;
// The internal timer is accessed as special purpose register #268
// (#24.576 MHz => 1tick=4.069010416E-8 sec,~.04µs
asm volatile ("mfspr %0,268; sync" : "=r" (time));
return time;
}
timer_now will return tics that are still not at the processor speed, but much faster than 8kHz, the time taken can then be calculated as tics * 0.04µs.
NOTE This may only work for the powerPC MPC5200 BSP for rtems, since it uses an assembler routine.

In RTEMS 4.11 or newer you can use rtems_counter_read to obtain high-precision counters that abstracts the CPU-specific assembly code. Please see: https://docs.rtems.org/doxygen/cpukit/html/group__ClassicCounter.html
RTEMS related questions like this are invariably answered more quickly and accurately when submitted to the subscribe-only users mailing list.

Related

how to offload precise ADC oversampling with RISC-V GD32VF103CBT6 Development Board

I'm hoping to work up a very basic audio effects device using a RISC-V GD32VF103CBT6 Development Board. I have managed to do some hardware-interrupt-based sampling with another MCU, but I'm a bit confused by the documentation for the RISC-V board. Chapter 11 in the user manual. I haven't the slightest idea how to turn the instructions there into actual C/C++ code. Sadly, their github repo has almost no examples at all, and none appear to deal with high speed sampling. There's also a datasheet in this github repo but I haven't been able to find any specific code examples or revealing instruction in there, either.
What I want to do is:
Perform the calibration described in the user manual, which must precede sampling operations.
collect 12-bit audio samples of audio signal voltage off an external pin using its oversampling capability to sum numerous 12-bit samples into a single 16-bit sample at a high sampling rate. Ultimately I want audio sampled with 16-bits at 48khz-96khz.
I need help instructing the MCU to collect these samples using its built-in hardware features.
I want to continuously sample, offloading as much as possible to built-in hardware functions so I can leave enough processing overhead left to do a bit of signal processing for simple effects.
Section 11.4.1 clearly says
Calibration should be performed before starting A/D conversion.
The calibration is initiated by software by setting bit CLB=1. CLB bit stays at 1 during all the calibration sequence. It is then cleared by hardware as soon as the calibration is completed.
The internal analog calibration can be reset by setting the RSTCLB bit in ADC_CTL1 register.
Calibration software procedure:
1) Ensure that ADCON=1.
2) Delay 14 ADCCLK to wait for ADC stability
3) Set RSTCLB (optional)
4) Set CLB=1.5.Wait until CLB=0.
Question 1: How do I set these memory registers as these instructions indicate. I need a code example, and the manufacturer provides none.
Question 2: How do I delay 14 ADDCCLK in C/C++. Seems like a loop would be enormously inefficient. Should I call sleep()? Any explanation of ADDCCLK also helpful.
This also seems important, but I have no idea what it portends:
The ADCCLK clock provided by the clock controller is synchronous APB2 clock. The RCU controller has a dedicated programmable prescaler for the ADC clock.
I am not at all certain but I think this is the conversion mode I want:
Continuous conversion mode
This mode can be run on the regular channel group. The continuous conversion mode will be enabled when CTN bit in the ADC_CTL1 register is set. In this mode, the ADC performs conversion on the channel specified in the RSQ0[4:0]. When the ADCON has been set high, the ADC samples and converts specified channel, once the corresponding software trigger or external trigger is active. The conversion data will be stored in the ADC_RDATA register.
Software procedure for continuous conversion on a regular channel. To get rid of checking, DMA can be used to transfer the converted data:
1.Set the CTN and DMA bit in the ADC_CTL1 register
2.Configure RSQ0 with the analog channel number
3.Configure ADC_SAMPTx register
4.Configure ETERC and ETSRC bits in the ADC_CTL1 register if in need
5.Prepare the DMA module to transfer data from the ADC_RDATA.
6.Set the SWRCST bit, or generate an external trigger for the regular group
ADCCLK refers the input clock of the ADC. May be take a look at your datasheet. The most µC have a block diagram of the clock architecture of the µC usually there is a main system clock and then the different peripherals have a prescaler that you can program and which divide the system clock by some power of 2.
so 14 ADCCLK cycles mean that its not 14 CPU cycles but 14 ADC-Input-Clock edges.
For example if the ADC prescaler is set to 64 then you have to wait 64*14 CPU clock cycles.
How to wait at all:
Mostly (I do not know if such a thing is present on your device) peripherals have a busy flag that is set as long the current operation is ongoing. So may be you can poll this flag (e.g. like while (ADC0_FLAGS & ADC_ISBUSY); ).
Another option may be checking if there is an interrupt that signals the completion of your operation. But at least for the calibration the simplest thing would be to start the calibration and just use a wait or delay function that just wastes a bit of time.
I personally would start the calibration on system start up and then doing other initialization stuff. May be delay at end of setup a few milliseconds to make sure all components on the board are powerd up correctly. After that the ADC should be already finished a long time.

STM32 (using Mbed online) showing delay at higher analog input frequency

I am new to the use of controllers.
I am setting up a STM32F769 Controller(Using Mbed online compiler), my target is to get a PWM output which changes its frequency according to an analog input. I did some basic coding but there is a problem. When i check the output on oscilloscope with analog input 1Hz frequency, its working perfectly, but when i check it with 100Hz analog input there is delay in the output, and i get wrong values. I do not understand why, because this board is faster(216 MHZ) and i should not face such issue. (If someone could also explain, is it possible to use the board at 216MHz or other max frequency? and how?)
1st time user
{
meas_r=0;
for(int i=1;i<=1024;i++)
{
meas_r = meas_r+analog_value.read();
}
meas_r=meas_r/1024;
meas_v = meas_r * 3300;
out_freq=50000+(meas_v*50);
pulse.period( 1.0 / out_freq);
}
}
It should be working on 100Hz analog input as it works on 1 Hz.
216MHz might be the maximum clock frequency at which your processor can operate, however it does not mean that it can input/output that much frequency from its ports.
The delays are caused by time it takes to read analog values and compute required math operations. You are using multiple multiplications and divisions which are more complex than adding and subtracting for almost any hardware device. Obviously, you are using library/libraries as well (pulse.period(), analog_value.read()), there are some hidden computations on top of those multiplications and divisions. Finally, it is possible that your device is working with other stuffs as well (only you know about this). All those computations need time. At lower frequencies you might not be able to notice the delay, however when the frequency is high enough, the delays can be noticed. Also consider time required to read the analog values many times.
Wrong signal and period is due to the delays and some other uncertainties. If the processor is working with other tasks as well, then it will be hard to predict time it takes to finish them all. As the processor executes the instructions line by line and waits for previous computation to finish before the new one starts, it causes some uncertainty in timing. Data path and frequency of peripheral devices (getting input from peripherals) play crucial role in timing uncertainty and delays.
If timing and accuracy is really important in solving the problem you have, and if you can't solve the problem with DSP, MPU, MCU, CPU, GPU, etc. I would suggest you to use an FPGA to solve the problem.

How do I measure GPU time on Metal?

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.
There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.
You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.
As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc

How to use .. QNX Momentics Application Profiler?

I'd like to profile my (multi-threaded) application in terms of timing. Certain threads are supposed to be re-activated frequently, i.e. a thread executes its main job once every fixed time interval. In other words, there's a fixed time slice in which all the threads a getting re-activated.
More precisely, I expect certain threads to get activated every 2ms (since this is the cycle period). I made some simplified measurements which confirmed the 2ms to be indeed effective.
For the purpose of profiling my app more accurately it seemed suitable to use Momentics' tool "Application Profiler".
However when I do so, I fail to interpret the timing figures that I selected. I would be interested in the average as well in the min and max time it takes before a certain thread is re-activated. So far it seems, the idea is to be only able to monitor the times certain functions occupy. However, even that does not really seem to be the case. E.g. I've got 2 lines of code that are put literally next to each other:
if (var1 && var2 && var3) var5=1; takes 1ms (avg)
if (var4) var5=0; takes 5ms (avg)
What is that supposed to tell me?
Another thing confuses me - the parent thread "takes" up 33ms on avg, 2ms on max and 1ms on min. Aside the fact that the avg shouldn't be bigger than max (i.e. even more I expect avg to be not bigger than 2ms - since this is the cycle time), it's actually increasing the longer I run the the profiling tool. So, if I would run the tool for half an hour the 33ms would actually be something like 120s. So, it seems that avg is actually the total amount of time the thread occupies the CPU.
If that is the case, I would assume to be able to offset against the total time using the count figure which doesn't work either. Mostly due to the figure being almost never available - i.e. there is only as a separate list entry (for every parent thread) called which does not represent a specific process scope.
So, I read QNX community wiki about the "Application Profiler", incl. the manual about "New IDE Application Profiler Enhancements", as well as the official manual articles about how to use the profiler tool.. but I couldn't figure out how I would use the tool to serve my interest.
Bottom line: I'm pretty sure I'm misinterpreting and misusing the tool for what it was intended to be used. Thus my question - how would I interpret the numbers or use the tool's feedback properly to get my 2ms cycle time confirmed?
Additional information
CPU: single core
QNX SDP 6.5 / Momentics 4.7.0
Profiling Method: Sampling and Call Count Instrumentation
Profiling Scope: Single Application
I enabled "Build for Profiling (Sampling and Call Count Instrumentation)" in the Build Options1
The System Profiler should give you what you are looking for. It hooks into the micro kernel and lets you see the state of all threads on the system. I used it in a similar setup to find out what our system was getting unexpected time-outs. (The cause turned out to be Page Waits on critical threads.)

How to test Interrupt Latency?

Windows Embedded Compact 7.
Is there a way to test interrupt latency time from user space?
Are there any tools provided as part of platform builder?
I also saw a program called Intrtime.exe - but no examples on how to use it.
How does one test the interrupt latency time?
Reference for Intrtime.exe but how do I implement it?
http://www.ece.ufrgs.br/~cpereira/temporeal_pos/www/WindowsCE2RT.htm
EDIT
Also found:
ILTiming.exe Real-Time Measurement Tool (Compact 2013)
http://msdn.microsoft.com/en-us/library/ee483144.aspx
This really is a test that requires hardware, and there are a couple "latencies" you might measure. Once is the time from the interrupt signal to when the driver ISR reacts and the second is from when the interrupt occurs to when an IST reacts.
I did this back in the CE 3.0/CE 4.0 days by attaching a signal generator to an interruptable input an then having an ISR pulse a second input and an IST pulse a third input when they received the interrupt. I hooked a scope up to the input and outputs and used it to measure time between the input signal and output signals to get not just latency, but also jitter. You could easily add a 4th line for CE 7 so you could check an IST in user space and an IST in kernel space. I'd definitely be interested to see the results.
I don't think you can effectively measure this with software running on the platform, as you get into the problem of the code trying to do the measurement affecting the results. You're also talking time way, way below the system tick resolution so the scheduler is going to be problematic as well. CeLog might be able to get you an idea on these times, but getting it set up and running is probably more work than just hooking up a scope.
What is usually meant by interrupt latency is the time between an interrupt source asserting the interrupt line and a thread (sometimes in user-space) being scheduled and then executing as a result.
Unless your CPU has some accurate way of time-stamping interrupt events as they arrive at the CPU (rather than when an ISR runs), the only truly accurate measurement is one done externally - by measuring the time between a the interrupt line being asserted and some observable signal that the thread responding to the interrupt can control. A DSO or logic analyser is usually used for this purpose.
Software techniques usually rely on storing an accurate time-stamp at the earliest opportunity in an ISR. If you're certain the time between interrupt line becoming asserted and the ISR running is negligible, this might be valid. If, on the other hand, disabling of interrupts is being used to control concurrency, or interrupts are nested, you probably want to be measuring this as well.