C++ High frequency loop/thread timing

C++ High frequency loop/thread timing - c++

For a bit of context, I am writing a simple CPU emulator. The emulator process boils down to calling a 'step' function to read and execute the next operation in the program. Currently this is just done as fast as possible in a while loop.
I would like the code to be cross-platform but (unfortunately) windows is the primary target.
I need to be able to execute my Emulator->step() function at regular intervals in the range of 1,000Hz to 100,000Hz.
For a slower loop I would simply use sleep() but (on windows at least) it doesn't have the resolution for such a high frequency.
I have also toyed with spinning a loop checking a Boost microsecond timer. Ignoring the inaccuracy of this method, it uses up real CPU time whilst it is meant to be 'idle'. I am running several emulated CPUs concurrently in threads so the while loop causes a noticeable impact on performance.
Surely there is a method of doing what I want to do with C++?

You can't sleep precisly under Windows (maybe the Windows Performance counter functions help, see Is there a Windows equivalent of nanosleep?).
You said that you run many simulated CPU's concurrently in threads, so one possible solution is to throw the threads away and do the Schedueling for the different CPU's yourself (round robin).

You don't need any special sleep resolution. At the end of each loop, just compute whether you need to sleep or not. If not, run the next loop. If so, sleep for the calculated amount. It won't matter if you sleep a little extra on one loop because this logic will make you sleep less on the next loop.

If you are using a C++11 compiler, have a look at <chrono>. There you can find high precision timers. But be aware, that in a windows environment, these still have low accuracy, which Microsoft will hopefully fix in the next release.

Related

C++11 Most accurate way to pause execution for a certain amount of time? [duplicate]

This question already has answers here:
How to guarantee exact thread sleep interval?
(4 answers)
accurate sampling in c++
(2 answers)
Closed 4 years ago.
I'm currently working on some C++ code that reads from a video file, parses the video/audio streams into its constituent units (such as an FLV tag) and sends it back out in order to "restream" it.
Because my input comes from file but I want to simulate a proper framerate when restreaming this data, I am considering the ways that I might sleep the thread that is performing the read on the file in order to attempt to extract a frame at the given rate that one might expect out of typical 30 or 60 FPS.
One solution is to use an obvious std::this_thread::sleep_for call and pass in the amount of milliseconds depending on what my FPS is. Another solution I'm considering is using a condition variable, and using std::condition_variable::wait_for with the same idea.
I'm a little stuck, because I know that the first solution doesn't guarantee exact precision -- the sleep will last around as long as the argument I pass in but may in theory be longer. And I know that the std::condition_variable::wait_for call will require lock reacquisition which will take some time too. Is there a better solution than what I'm considering? Otherwise, what's the best methodology to attempt to pause execution for as precise a granularity as possible?

C++11 Most accurate way to pause execution for a certain amount of time?
This:
auto start = now();
while(now() < start + wait_for);
now() is a placeholder for the most accurate time measuring method available for the system.
This is the analogue to sleep as what spinlock is to a mutex. Like a spinlock, it will consume all the CPU cycles while pausing, but it is what you asked for: The most accurate way to pause execution. There is trade-off between accuracy and CPU-usage-efficiency: You must choose which is more important to have for your program.
why is it more accurate than std::this_thread::sleep_for?
Because sleep_for yields execution of the thread. As a consequence, it can never have better granularity than the process scheduler of the operating system has (assuming there are other processes competing for time).
The live loop shown above which doesn't voluntarily give up its time slice will achieve the highest granularity provided by the clock that is used for measurement.
Of course, the time slice granted by the scheduler will eventually run out, and that might happen near the time we should be resuming. Only way to reduce that effect is to increase the priority of our thread. There is no standard way of affecting the priority of a thread in C++. The only way to get completely rid of that effect is to run on a non-multi-tasking system.
On multi-CPU systems, one trick that you might want to do is to set the thread affinity so that the OS thread won't be migrated to other hard ware threads which would introduce latency. Likewise, you might want to set thread affinity of your other threads to stay off the time measuring thread. There is no standard tool to set thread affinity.
Let T be the time you wish to sleep for and let G be the maximum time that sleep_for could possibly overshoot.
If T is greater than G, then it will be more efficient to use sleep_for for T - G time units, and only use the live loop for the final G - O time units (where O is the time that sleep_for was observed to overshoot).
Figuring out what G is for your target system can be quite tricky however. There is no standard tool for that. If you over-estimate, you'll waste more cycles than necessary. If you under-estimate, your sleep may overshoot the target.
In case you're wondering what is a good choice for now(), the most appropriate tool provided by the standard library is std::chrono::steady_clock. However, that is not necessarily the most accurate tool available on your system. What tool is the most accurate depends on what system you're targeting.

Analyzing spikes in performance measurement

I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?

Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.

That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.

What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.

Sleeping for an exact duration

My understanding of the Sleep function is that it follows "at least semantics" i.e. sleep(5) will guarantee that the thread sleeps for 5 seconds, but it may remain blocked for more than 5 seconds depending on other factors. Is there a way to sleep for exactly a specified time period (without busy waiting).

As others have said, you really need to use a real-time OS to try and achieve this. Precise software timing is quite tricky.
However... although not perfect, you can get a LOT better results than "normal" by simply boosting the priority of the process that needs better timing. In Windows you can achieve this with the SetPriorityClass function. If you set the priority to the highest level (REALTIME_PRIORITY_CLASS: 0x00000100) you'll get much better timing results. Again - this will not be perfect like you are asking for, though.
This is also likely possible on other platforms than Windows, but I've never had reason to do it so haven't tested it.
EDIT: As per the comment by Andy T, if your app is multi-threaded you also need to watch out for the priority assigned to the threads. For Windows this is documented here.
Some background...
A while back I used SetPriorityClass to boost the priority on an application where I was doing real-time analysis of high-speed video and I could NOT miss a frame. Frames were arriving to the pc at a very regular (driven by external framegrabber HW) frequency of 300 frames per second (fps), which fired a HW interrupt on every frame which I then serviced. Since timing was very important, I collected a lot of stats on the interrupt timing (using QueryPerformanceCounter stuff) to see how bad the situation really was, and was appalled at the resulting distributions. I don't have the stats handy, but basically Windows was servicing the interrupt whenever it felt like it when run at normal priority. The histograms were very messy, with the stdev being wider than my ~3ms period. Frequently I would have gigantic gaps of 200 ms or greater in the interrupt servicing (recall that the interrupt fired roughly every 3 ms)!! ie: HW interrupts are FAR from exact! You're stuck with what the OS decides to do for you.
However - when I discovered the REALTIME_PRIORITY_CLASS setting and benchmarked with that priority, it was significantly better and the service interval distribution was extremely tight. I could run 10 minutes of 300 fps and not miss a single frame. Measured interrupt servicing periods were pretty much exactly 1/300 s with a tight distribution.
Also - try and minimize the other things the OS is doing to help improve the odds of your timing working better in the app where it matters. eg: no background video transcoding or disk de-fragging or anything while your trying to get precision timing with other code!!
In summary:
If you really need this, go with a real time OS
If you can't use a real-time OS (impossible or impractical), boosting your process priority will likely improve your timing by a lot, as it did for me
HW interrupts won't do it... the OS still needs to decide to service them!
Make sure that you don't have a lot of other processes running that are competing for OS attention
If timing is really important to you, do some testing. Although getting code to run exactly when you want it to is not very easy, measuring this deviation is quite easy. The high performance counters in PCs (what you get with QueryPerformanceCounter) are extremely good.
Since it may be helpful (although a bit off topic), here's a small class I wrote a long time ago for using the high performance counters on a Windows machine. It may be useful for your testing:
CHiResTimer.h
#pragma once
#include "stdafx.h"
#include <windows.h>
class CHiResTimer
{
private:
LARGE_INTEGER frequency;
LARGE_INTEGER startCounts;
double ConvertCountsToSeconds(LONGLONG Counts);
public:
CHiResTimer(); // constructor
void ResetTimer(void);
double GetElapsedTime_s(void);
};
CHiResTimer.cpp
#include "stdafx.h"
#include "CHiResTimer.h"
double CHiResTimer::ConvertCountsToSeconds(LONGLONG Counts)
{
return ((double)Counts / (double)frequency.QuadPart) ;
}
CHiResTimer::CHiResTimer()
{
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&startCounts); // starts the timer right away
}
void CHiResTimer::ResetTimer()
{
QueryPerformanceCounter(&startCounts); // reset the reference counter
}
double CHiResTimer::GetElapsedTime_s()
{
LARGE_INTEGER countsNow;
QueryPerformanceCounter(&countsNow);
return ConvertCountsToSeconds(countsNow.QuadPart - startCounts.QuadPart);
}

No.
The reason it's "at least semantics" is because that after those 5 seconds some other thread may be busy.
Every thread gets a time slice from the Operating System. The Operating System controls the order in which the threads are run.
When you put a thread to sleep, the OS puts the thread in a waiting list, and when the timer is over the operating system "Wakes" the thread.
This means that the thread is added back to the active threads list, but it isn't guaranteed that t will be added in first place. (What if 100 threads need to be awaken in that specific second ? Who will go first ?)

While standard Linux is not a realtime operating system, the kernel developers pay close attention to how long a high priority process would remain starved while kernel locks are held. Thus, a stock Linux kernel is usually good enough for many soft-realtime applications.
You can schedule your process as a realtime task with the sched_setscheduler(2) call, using either SCHED_FIFO or SCHED_RR. The two have slight differences in semantics, but it may be enough to know that a SCHED_RR task will eventually relinquish the processor to another task of the same priority due to time slices, while a SCHED_FIFO task will only relinquish the CPU to another task of the same priority due to blocking I/O or an explicit call to sched_yield(2).
Be careful when using realtime scheduled tasks; as they always take priority over standard tasks, you can easily find yourself coding an infinite loop that never relinquishes the CPU and blocks admins from using ssh to kill the process. So it might not hurt to run an sshd at a higher realtime priority, at least until you're sure you've fixed the worst bugs.
There are variants of Linux available that have been worked on to provide hard-realtime guarantees. RTLinux has commercial support; Xenomai and RTAI are competing implementations of realtime extensions for Linux, but I know nothing else about them.

As previous answerers said: there is no way to be exact (some suggested realtime-os or hardware interrupts and even those are not exact). I think what you are looking for is something that is just more precise than the sleep() function and you find that depending on your OS in e.g. the Windows Sleep() function or under GNU the nanosleep() function.
http://msdn.microsoft.com/en-us/library/ms686298%28VS.85%29.aspx
http://www.delorie.com/gnu/docs/glibc/libc_445.html
Both will give you precision within a few milliseconds.

Well, you try to tackle a difficult problem, and achieving exact timing is not feasible: the best you can do is to use hardware interrupts, and the implementation will depend on both your underlying hardware, and your operating system (namely, you will need a real-time operating system, which most regular desktop OS are not). What is your exact target platform?

No. Because you're always depending on the OS to handle waking up threads at the right time.

There is no way to sleep for a specified time period using standard C. You will need, at minimum, a 3rd party library which provides greater granularity, and you might also need a special operating system kernel such as the real-time Linux kernels.
For instance, here is a discussion of how close you can come on Win32 systems.
This is not a C question.

Sleep thread 100.8564 millisecond in c++ under window plateform

I there any method to sleep the thread upto 100.8564 millisecond under window OS. I am using multimedia timer but its resolution is minimum 1 second. Kindly guide me so that I can handle the fractional part of the millisecond.

Yes you can do it. See QueryPerformanceCounter() to read accurate time, and make a busy loop.
This will enable you to make waits with up to 10 nanosecond resolution, however, if thread scheduler decides to steal control from you at the moment of the cycle end, it will, and there's nothing you can do about it except assigning your process realtime priority.
You may also have a look at this: http://msdn.microsoft.com/en-us/library/ms838340(WinEmbedded.5).aspx
Several frameworks were developed to do hard realtime on windows.
Otherwise, your question probably implies that you might be doing something wrong. There're numerous mechanisms to trick around ever needing precise delays, such as using proper bus drivers (in case of hardware/IO, or respective DMAs if you are designing a driver), and more.
Please tell us what exactly are you building.

I do not know your use case, but even a high end realtime operating system would be hard pressed to achieve less 100ns jitter on timings.
In most cases I found you do not need that precision in reproducibility but only for long time drift. In that respect it is relatively straightforward to keep a timeline and calculate the event on the desired precision. Then use that timeline to synchronize the events which may be off even by 10's of ms. As long as these errors do not add up, I found I got adequate performance.

If you need guaranteed latency, you cannot get it with MS Windows. It's not a realtime operating system. It might swap in another thread or process at an importune instant. You might get a cache miss. When I did a robot controller a while back, I used an OS called On Time RTOS 32. It has an MS Windows API emulation layer. You can use it with Visual Studio. You'll need something like that.

The resolution of a multimedia timer is much better than one second. It can go down to 1 millisecond when you call timeBeginPeriod(1) first. The timer will automatically adjust its interval for the next call when the callback is delivered late. Which is inevitable on a multi-tasking operating system, there is always some kind of kernel thread with a higher priority than yours that will delay the callback.
While it will work pretty well on average, worst case latency is in the order of hundreds of milliseconds. Clearly, your requirements cannot be met by Windows by a long shot. You'll need some kind of microcontroller to supply that kind of execution guarantee.

CPU throttling in C++

I was just wondering if there is an elegant way to set the maximum CPU load for a particular thread doing intensive calculations.
Right now I have located the most time consuming loop in the thread (it does only compression) and use GetTickCount() and Sleep() with hardcoded values. It makes sure that the loop continues for a certain period and then sleeps for a certain minimum time. It more or less does the job, i.e. guarantees that the thread will not use more than 50% of CPU. However, behavior is dependent on the number of CPU cores (huge disadvantage) and simply ugly (smaller disadvantage :)). Any ideas?

I am not aware of any API to do get the OS's scheduler to do what you want (even if your thread is idle-priority, if there are no higher-priority ready threads, yours will run). However, I think you can improvise a fairly elegant throttling function based on what you are already doing. Essentially (I don't have a Windows dev machine handy):
Pick a default amount of time the thread will sleep each iteration. Then, on each iteration (or on every nth iteration, such that the throttling function doesn't itself become a significant CPU load),
Compute the amount of CPU time your thread used since the last time your throttling function was called (I'll call this dCPU). You can use the GetThreadTimes() API to get the amount of time your thread has been executing.
Compute the amount of real time elapsed since the last time your throttling function was called (I'll call this dClock).
dCPU / dClock is the percent CPU usage (of one CPU). If it is higher than you want, increase your sleep time, if lower, decrease the sleep time.
Have your thread sleep for the computed time.
Depending on how your watchdog computes CPU usage, you might want to use GetProcessAffinityMask() to find out how many CPUs the system has. dCPU / (dClock * CPUs) is the percentage of total CPU time available.
You will still have to pick some magic numbers for the initial sleep time and the increment/decrement amount, but I think this algorithm could be tuned to keep a thread running at fairly close to a determined percent of CPU.

On linux, you can change the scheduling priority of a thread with nice().

I can't think of any cross platform way of what you want (or any guaranteed way full stop) but as you are using GetTickCount perhaps you aren't interested in cross platform :)
I'd use interprocess communications and set the intensive processes nice levels to get what you require but I'm not sure that's appropriate for your situation.
EDIT:
I agree with Bernard which is why I think a process rather than a thread might be more appropriate but it just might not suit your purposes.

The problem is it's not normal to want to leave the CPU idle while you have work to do. Normally you set a background task to IDLE priority, and let the OS handle scheduling it all the CPU time that isn't used by interactive tasks.
It sound to me like the problem is the watchdog process.
If your background task is CPU-bound then you want it to take all the unused CPU time for its task.
Maybe you should look at fixing the watchdog program?

You may be able to change the priority of a thread, but changing the maximum utilization would either require polling and hacks to limit how many things are occurring, or using OS tools that can set the maximum utilization of a process.
However, I don't see any circumstance where you would want to do this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js