Sub-millisecond precision timing in C or C++ - c++

What techniques / methods exist for getting sub-millisecond precision timing data in C or C++, and what precision and accuracy do they provide? I'm looking for methods that don't require additional hardware. The application involves waiting for approximately 50 microseconds +/- 1 microsecond while some external hardware collects data.
EDIT: OS is Wndows, probably with VS2010. If I can get drivers and SDK's for the hardware on Linux, I can go there using the latest GCC.

When dealing with off-the-shelf operating systems, accurate timing is an extremely difficult and involved task. If you really need guaranteed timing, the only real option is a full real-time operating system. However if "almost always" is good enough, here are a few tricks you can use that will provide good accuracy under commodity Windows & Linux
Use a Sheilded CPU Basically, this means turn off IRQ affinity for a selected CPU & set the processor affinity mask for all other processes on the machine to ignore your targeted CPU. On your app, set the CPU affinity to run only on your shielded CPU. Effectively, this should prevent the OS from ever suspending your app as it will always be the only runnable process for that CPU.
Never allow let your process willingly yield control to the OS (which is inherently non-deterministic for non realtime OSes). No memory allocation, no sockets, no mutexes, nada. Use the RDTSC to spin in a while loop waiting for your target time to arrive. It'll consume 100% CPU but it's the most accurate way to go.
If number 2 is a bit too draconic, you can 'sleep short' and then burn the CPU up to your target time. Here, you take advantage of the fact that the OS schedules the CPU at set intervals. Usually 100 times per second or 1000 times per second depending on your OS and configuration (On windows you can change the default scheduling period of 100/s to 1000/s using the multimedia API). This can be a little hard to get right but essentially you need determine when the OS scheduling periods occur and calculate the one prior to your target wake time. Sleep for this duration and then, upon waking, spin on RDTSC (if you're on a single CPU... use QueryPerformanceCounter or the Linux equivalent if not) until your target time arrives. Occasionally, OS scheduling will cause you to miss but, generally speaking, this mechanism works pretty good.
It seems like a simple question, but attaining 'good' timing get's exponentially more difficult the tighter your timing constraints are. Good luck!

The hardware (and therefore resolution) varies from machine to machine. On Windows, specifically (I'm not sure about other platforms), you can use QueryPerformanceCounter and QueryPerformanceFrequency, but be aware you should call both from the same thread and there are no strict guarantees about resolution (QueryPerformanceFrequency is allowed to return 0 meaning no high resolution timer is available). However, on most modern desktops, there should be one accurate to microseconds.

boost::datetime has microsecond precision clock but its accuracy depends on the platform.
The documentation states:
ptime microsec_clock::local_time()
"Get the local time using a sub second resolution clock. On Unix systems this is implemented using GetTimeOfDay. On most Win32 platforms it is implemented using ftime. Win32 systems often do not achieve microsecond resolution via this API. If higher resolution is critical to your application test your platform to see the achieved resolution."
http://www.boost.org/doc/libs/1_43_0/doc/html/date_time/posix_time.html#date_time.posix_time.ptime_class

You may try the following:
struct timeval t;
gettimeofday(&t,0x0);
This gives you current timestamp in micro-seconds. I am not sure about the accuracy.

You could try the technique described here, but it's not portable.

Most modern processors have registers for timing or other instrumentation purposes. On x86 since Pentium days there is the RDTSC instruction, for example. You compiler may give you access to this instruction.
See wikipedia for more info.

timeval in sys/time.h has a member 'tv_usec' which is microseconds.
This link and the code below will help illustrate:
http://www.opengroup.org/onlinepubs/000095399/basedefs/sys/time.h.html
timeval start;
timeval finish;
long int sec_diff;
long int mic_diff;
gettimeofday(&start, 0);
cout << "whooo hooo" << endl;
gettimeofday(&finish, 0);
sec_diff = finish.tv_sec - start.tv_sec;
mic_diff = finish.tv_usec - start.tv_usec;
cout << "cout-ing 'whooo hooo' took " << sec_diff << "seconds and " << mic_diff << " micros." << endl;
gettimeofday(&start, 0);
printf("whooo hooo\n");
gettimeofday(&finish, 0);
sec_diff = finish.tv_sec - start.tv_sec;
mic_diff = finish.tv_usec - start.tv_usec;
cout << "fprint-ing 'whooo hooo' took " << sec_diff << "seconds and " << mic_diff << " micros." << endl;

Good luck trying to do that with MS Windows. You need a realtime operating system, that is to say, one where timing is guaranteed repeatable. Windows can switch to another thread or even another process at an inopportune moment. You will also have no control over cache misses.
When I was doing realtime robotic control, I used a very lightweight OS called OnTime RTOS32, which has a partial Windows API emulation layer. I do not know if it would be suitable for what you are doing. However, with Windows, you will probably never be able to prove that it will never fail to give the timely response.

A combination of GetSystemTimeAsFileTime and QueryPerformanceCounter can result in a reliable suite of code to obtain microsecond resolution time services on windows.
See this comment in another thread here.

Related

How to guarantee exact thread sleep interval?

Usually if I want to simulate some work or wait exact time interval I use condition_variable::wait_for or at the worst thread::this_thread::sleep_for. But condition_variable documentation states that wait_for or wait_until methods may block longer than was requested.
This function may block for longer than timeout_duration due to scheduling or resource contention delays.
How exact wait intervals can be guaranteed?
UPDATE
Can I reach it without condition_variable?
You cannot do this.
In order to have exact guarantees like this, you need a real time operating system.
C++ does not guarantee you are on a real time operating system.
So it provides the guarantees that a typical, non-RTOS provides.
Note that there are other complications to programming on a RTOS that go far beyond the scope of this question.
In practice, one thing people when they really want fine-grained timing control (say, they are twiddling around with per-frame or per-scanline buffers or the like, or audio buffers, or whatever) do is check if the time is short, and if so spin-wait. If the time is longer, they wait for a bit less than the amount of time they want to wait for, then wake up and spin.
This is also not guaranteed to work, but works well enough for almost all cases.
On a RTOS, the platform may provide primitives like you want. These lie outside the scope of standard C++. No typical desktop OS is an RTOS that I am aware of. If you are programming for a fighter jet's control hardware or similar, you may be programming on an RTOS.
I hope you aren't writing fighter jet control software and asking this question on stack overflow.
If you did hypothetically sleep for precisely some exact duration, and then performed some action in response (such as getting the current time, or printing a message to the screen), then that action might be delayed for some unknown period of time e.g. due to processor load. This is equivalent to the action happening (almost) immediately but the timer taking longer than expected. Even in the best case scenario, where the timer completes at precisely the time you request, and the operating system allows your action to complete without preempting your process, it will take a few clock cycles to perform that action.
So in other words, on a standard operating system, it is impossible or maybe even meaningless for a timer to complete at precisely the time requested.
How can this be overcome? An academic answer is that you can used specialized software and hardware such as a real-time operating system, but this is vastly more complicated to develop software for than regular programming. What you probably really want to know is, in the common case, the delay that that documentation refers to is not substantial i.e. it is commonly less that 1/100th second.
With a brute force loop... for example:
chrono::microseconds sleep_duration{1000};
auto now = chrono::high_resolution_clock::now()
while(true)
{
auto elapsed = chrono::duration_cast<hrono::microseconds>(chrono::high_resolution_clock::now() - now);
if (elapsed > sleep_duration)
break;
}
That's bit ugly but desktop operating system are not real time so you cannot have such precision.
In order to relax the cpu you look the following snippet:
void little_sleep(std::chrono::microseconds us)
{
auto start = std::chrono::high_resolution_clock::now();
auto end = start + us;
do {
std::this_thread::yield();
} while (std::chrono::high_resolution_clock::now() < end);
}
That depends on what accuracy you can expect. Generally as others have said regular OS (Linux, Windows) cannot guaranty that.
Why?
Your OS have probably has concept of threads. If so, then there is a scheduler which interrupts threads and switch execution to other threads waiting in the queue. And this can spoil accuracy of timers.
What can I do about it?
If you are using embedded system - go for bare metal, i.e. don't use
OS or use so called hard real time operating system.
If you are using Linux, look for Linux RT Preempt Patch in Google. You have to recompile your kernel to include the path (not so complicated though) and then you can create threads with priority above 50 - which means priority above kernel's thread - which in the end means that you can have a thread that can interrupt scheduler and kernel in general, providing quite good time accuracy. In my case it what three orders of magnitude (from few ms of latency to few us).
If you are using Windows, I don't know about such patch, but you can search for for High Precisions timers on Microsoft site. Maybe provided accuracy will be enough for your needs.

Is clock() in c++ consistent with heavy CPU loads

Right now I basically have a program that uses clock to test the amount of time my program takes to do certain operations and usually it is accurate to a couple milliseconds. My question is this: If the CPU is under heavy load will I still get the same results?
Does clock only count when the CPU is working on my process?
Lets assume: multi-core CPU but a process that does not take advantage of multithreading
The function of clock depends on the OS. In windows, from a long distant decision, clock gives the elapsed time, in most other OS's (certainly Linux, MacOS and other Unix-related OS's).
Depending on what you actually want to achieve, elapsed time or CPU time may be what you want to measure.
In a system where there are other processes running, the difference between elapsed time and CPU usage may be huge (and of course, if your CPU is NOT busy running your application, e.g. waiting for network packets to go down the wire or file-data from the hard-disk), then elapsed time is "avialable" for other applications.
There are also a huge number of error factors/interference factors when there are other processes running in the same system:
If we assume that your OS supports clock as a measure of CPU-time, the precision here is not always that great - for example, it may well be accounted in terms of CPU-timer ticks, and your process may not run for the "full tick" if it's doing I/O for example.
Other processes may use "your" cpu for parts of the interrupt handling, before the OS has switched to "account for this as interrupt time", when dealing with packets over the network or hard-disk i/o for some percentage of the time [typically not huge amounts, but in a very busy system, it can be several percent of the total time], and if other processes run on "your" cpu, the time to reload the cache(s) with "your" process' data after the other process loaded it's data will be accounted on "your time". This sort of "interference" may very well affect your measurements - how much very much depends on "what else" is going on in the system.
If your process shares data [via shared memory] with another process, there will also be (again, typically a minute amount, but in extreme cases, it can be significant) some time spent on dealing with "cache-snoop requests" between your process and the other process, when your process doesn't get to execute.
If the OS is switching tasks, "half" of the time spent switching to/from your task will accounted in your process, and half in the other process being switched in/out. Again, this is usually tiny amounts, but if you have a very busy system with lots of process switches, it can add up.
Some processor types, e.g. Intel's HyperThreading also share resources with your actual core, so only SOME of the time on that core is spent in your process, and the cache-content of your process is now shared with some other process' data and instructions - meaning your process MAY get "evicted" from the cache by the other thread running on the same CPU-core.
Likewise, multicore CPU's often have a shared L3 cache that gets
affected by other processes running on the other cores of the CPU.
File-caching and other "system caches" will also be affected by other processes - so if your process is reading some file(s), and other processes also access file(s), the cache-content will be "less yours" than if the system wasn't so busy.
For accurate measurements of how much your process uses of the system resources, you need processor performance counters (and a reproducable test-case, because you probably need to run the same setup several times to ensure that you get the "right" combination of performance counters). Of course, most of these counters are ALSO system-wide, and some kinds of processing in for example interrupts and other random interference will affect the measurement, so the most accurate results will be if you DON'T have many other (busy) processes running in the system.
Of course, in MANY cases, just measuring the overall time of your application is perfectly adequate. Again, as long as you have a reproducable test-case that gives the same (or at least similar) timing each time it's run in a particular scenario.
Each application is different, each system is different. Performance measurement is a HUGE subject, and it's very hard to cover EVERYTHING - and of course, we're not here to answer extremely specific questions about "how do I get my PI-with-a-million-decimals to run faster when there are other processes running in the same system" or whatever it may be.
In addition to agreeing with the responses indicating that timings depend on many factors, I would like to bring to your attention the std::chrono library available since C++11:
#include <chrono>
#include <iostream>
int main() {
auto beg = std::chrono::high_resolution_clock::now();
std::cout << "*** Displaying Some Stuff ***" << std::endl;
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end - beg);
std::cout << "Elapsed: " << dur.count() << " microseconds" << std::endl;
}
As per the standard, this program will utilize the highest-precision clock provided by your system and will tick with microsecond resolution (there are other resolutions available; see the docs).
Sample run:
$ g++ example.cpp -std=c++14 -Wall -Wextra -O3
$ ./a.out
*** Displaying Some Stuff ***
Elapsed: 29 microseconds
While it is much more verbose than relying on the C-style std::clock(), I feel it gives you much more expressiveness, and you can hide the verbosity behind a nice interface (for example, see my answer to a previous post where I use std::chrono to build a function timer).
There are shared components in CPU like last level cache, execution units (between hardware threads within one core), so under heavy loads you will get jitter, because even if your application executed exactly the same amount of instructions, each instructions may take more cycles (waiting for memory because data was evicted from cache, available execution unit), and more cycles means more time to execute (assuming that Turbo Boost won't compensate).
If you seek for precise instrument, look at hardware counters.
It is also important to consider factors like the number of cores available on the physical CPU, hyper-threading and other BIOS settings like Turbo Boost on Intel CPUs, and threading techniques used when coding when looking at timing metrics for CPU intensive tasks.
Parallelization tools like OpenMP provide built-in functions for calculating computation and wall time like omp_get_wtime( ); which are often times more accurate than clock() in programs that make use of this type of parallelization.

Correct way to logging elapsed time in C++

I'm doing a article about GPU speed up in cluster environment
To do that, I'm programming in CUDA, that is basically a c++ extension.
But, as I'm a c# developer I don't know the particularities of c++.
There is some concern about logging elapsed time? Some suggestion or blog to read.
My initial idea is make a big loop and run the program several times. 50 ~ 100, and log every elapsed time to after make some graphics of velocity.
Depending on your needs, it can be as easy as:
time_t start = time(NULL);
// long running process
printf("time elapsed: %d\n", (time(NULL) - start));
I guess you need to tell how you plan this to be logged (file or console) and what is the precision you need (seconds, ms, us, etc). "time" gives it in seconds.
I would recommend using the boost timer library . It is platform agnostic, and is as simple as:
#include <boost/timer/timer.hpp>
boost::timer t;
// do some stuff, up until when you want to start timing
t.restart();
// do the stuff you want to time.
std::cout << t.elapsed() << std::endl;
Of course t.elapsed() returns a double that you can save to a variable.
Standard functions such as time often have a very low resolution. And yes, a good way to get around this is to run your test many times and take an average. Note that the first few times may be extra-slow because of hidden start-up costs - especially when using complex resources like GPUs.
For platform-specific calls, take a look at QueryPerformanceCounter on Windows and CFAbsoluteTimeGetCurrent on OS X. (I've not used POSIX call clock_gettime but that might be worth checking out.)
Measuring GPU performance is tricky because GPUs are remote processing units running separate instructions - often on many parallel units. You might want to visit Nvidia's CUDA Zone for a variety of resources and tools to help measure and optimize CUDA code. (Resources related to OpenCL are also highly relevant.)
Ultimately, you want to see how fast your results make it to the screen, right? For that reason, a call to time might well suffice for your needs.

Sleeping for an exact duration

My understanding of the Sleep function is that it follows "at least semantics" i.e. sleep(5) will guarantee that the thread sleeps for 5 seconds, but it may remain blocked for more than 5 seconds depending on other factors. Is there a way to sleep for exactly a specified time period (without busy waiting).
As others have said, you really need to use a real-time OS to try and achieve this. Precise software timing is quite tricky.
However... although not perfect, you can get a LOT better results than "normal" by simply boosting the priority of the process that needs better timing. In Windows you can achieve this with the SetPriorityClass function. If you set the priority to the highest level (REALTIME_PRIORITY_CLASS: 0x00000100) you'll get much better timing results. Again - this will not be perfect like you are asking for, though.
This is also likely possible on other platforms than Windows, but I've never had reason to do it so haven't tested it.
EDIT: As per the comment by Andy T, if your app is multi-threaded you also need to watch out for the priority assigned to the threads. For Windows this is documented here.
Some background...
A while back I used SetPriorityClass to boost the priority on an application where I was doing real-time analysis of high-speed video and I could NOT miss a frame. Frames were arriving to the pc at a very regular (driven by external framegrabber HW) frequency of 300 frames per second (fps), which fired a HW interrupt on every frame which I then serviced. Since timing was very important, I collected a lot of stats on the interrupt timing (using QueryPerformanceCounter stuff) to see how bad the situation really was, and was appalled at the resulting distributions. I don't have the stats handy, but basically Windows was servicing the interrupt whenever it felt like it when run at normal priority. The histograms were very messy, with the stdev being wider than my ~3ms period. Frequently I would have gigantic gaps of 200 ms or greater in the interrupt servicing (recall that the interrupt fired roughly every 3 ms)!! ie: HW interrupts are FAR from exact! You're stuck with what the OS decides to do for you.
However - when I discovered the REALTIME_PRIORITY_CLASS setting and benchmarked with that priority, it was significantly better and the service interval distribution was extremely tight. I could run 10 minutes of 300 fps and not miss a single frame. Measured interrupt servicing periods were pretty much exactly 1/300 s with a tight distribution.
Also - try and minimize the other things the OS is doing to help improve the odds of your timing working better in the app where it matters. eg: no background video transcoding or disk de-fragging or anything while your trying to get precision timing with other code!!
In summary:
If you really need this, go with a real time OS
If you can't use a real-time OS (impossible or impractical), boosting your process priority will likely improve your timing by a lot, as it did for me
HW interrupts won't do it... the OS still needs to decide to service them!
Make sure that you don't have a lot of other processes running that are competing for OS attention
If timing is really important to you, do some testing. Although getting code to run exactly when you want it to is not very easy, measuring this deviation is quite easy. The high performance counters in PCs (what you get with QueryPerformanceCounter) are extremely good.
Since it may be helpful (although a bit off topic), here's a small class I wrote a long time ago for using the high performance counters on a Windows machine. It may be useful for your testing:
CHiResTimer.h
#pragma once
#include "stdafx.h"
#include <windows.h>
class CHiResTimer
{
private:
LARGE_INTEGER frequency;
LARGE_INTEGER startCounts;
double ConvertCountsToSeconds(LONGLONG Counts);
public:
CHiResTimer(); // constructor
void ResetTimer(void);
double GetElapsedTime_s(void);
};
CHiResTimer.cpp
#include "stdafx.h"
#include "CHiResTimer.h"
double CHiResTimer::ConvertCountsToSeconds(LONGLONG Counts)
{
return ((double)Counts / (double)frequency.QuadPart) ;
}
CHiResTimer::CHiResTimer()
{
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&startCounts); // starts the timer right away
}
void CHiResTimer::ResetTimer()
{
QueryPerformanceCounter(&startCounts); // reset the reference counter
}
double CHiResTimer::GetElapsedTime_s()
{
LARGE_INTEGER countsNow;
QueryPerformanceCounter(&countsNow);
return ConvertCountsToSeconds(countsNow.QuadPart - startCounts.QuadPart);
}
No.
The reason it's "at least semantics" is because that after those 5 seconds some other thread may be busy.
Every thread gets a time slice from the Operating System. The Operating System controls the order in which the threads are run.
When you put a thread to sleep, the OS puts the thread in a waiting list, and when the timer is over the operating system "Wakes" the thread.
This means that the thread is added back to the active threads list, but it isn't guaranteed that t will be added in first place. (What if 100 threads need to be awaken in that specific second ? Who will go first ?)
While standard Linux is not a realtime operating system, the kernel developers pay close attention to how long a high priority process would remain starved while kernel locks are held. Thus, a stock Linux kernel is usually good enough for many soft-realtime applications.
You can schedule your process as a realtime task with the sched_setscheduler(2) call, using either SCHED_FIFO or SCHED_RR. The two have slight differences in semantics, but it may be enough to know that a SCHED_RR task will eventually relinquish the processor to another task of the same priority due to time slices, while a SCHED_FIFO task will only relinquish the CPU to another task of the same priority due to blocking I/O or an explicit call to sched_yield(2).
Be careful when using realtime scheduled tasks; as they always take priority over standard tasks, you can easily find yourself coding an infinite loop that never relinquishes the CPU and blocks admins from using ssh to kill the process. So it might not hurt to run an sshd at a higher realtime priority, at least until you're sure you've fixed the worst bugs.
There are variants of Linux available that have been worked on to provide hard-realtime guarantees. RTLinux has commercial support; Xenomai and RTAI are competing implementations of realtime extensions for Linux, but I know nothing else about them.
As previous answerers said: there is no way to be exact (some suggested realtime-os or hardware interrupts and even those are not exact). I think what you are looking for is something that is just more precise than the sleep() function and you find that depending on your OS in e.g. the Windows Sleep() function or under GNU the nanosleep() function.
http://msdn.microsoft.com/en-us/library/ms686298%28VS.85%29.aspx
http://www.delorie.com/gnu/docs/glibc/libc_445.html
Both will give you precision within a few milliseconds.
Well, you try to tackle a difficult problem, and achieving exact timing is not feasible: the best you can do is to use hardware interrupts, and the implementation will depend on both your underlying hardware, and your operating system (namely, you will need a real-time operating system, which most regular desktop OS are not). What is your exact target platform?
No. Because you're always depending on the OS to handle waking up threads at the right time.
There is no way to sleep for a specified time period using standard C. You will need, at minimum, a 3rd party library which provides greater granularity, and you might also need a special operating system kernel such as the real-time Linux kernels.
For instance, here is a discussion of how close you can come on Win32 systems.
This is not a C question.

find c++ execution time

I am curious if there is a build-in function in C++ for measuring the execution time?
I am using Windows at the moment. In Linux it's pretty easy...
The best way on Windows, as far as I know, is to use QueryPerformanceCounter and QueryPerformanceFrequency.
QueryPerformanceCounter(LARGE_INTEGER*) places the performance counter's value into the LARGE_INTEGER passed.
QueryPerformanceFrequency(LARGE_INTEGER*) places the frequency the performance counter is incremented into the LARGE_INTEGER passed.
You can then find the execution time by recording the counter as execution starts, and then recording the counter when execution finishes. Subtract the start from the end to get the counter's change, then divide by the frequency to get the time in seconds.
LARGE_INTEGER start, finish, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start);
// Do something
QueryPerformanceCounter(&finish);
std::cout << "Execution took "
<< ((finish.QuadPart - start.QuadPart) / (double)freq.QuadPart) << std::endl;
It's pretty easy under Windows too - in fact it's the same function on both std::clock, defined in <ctime>
You can use the Windows API Function GetTickCount() and compare the values at start and end. Resolution is in the 16 ms ballpark. If for some reason you need more fine-grained timings, you'll need to look at QueryPerformanceCounter.
C++ has no built-in functions for high-granularity measuring code execution time, you have to resort to platform-specific code. For Windows try QueryPerformanceCounter: http://msdn.microsoft.com/en-us/library/ms644904(VS.85).aspx
The functions you should use depend on the resolution of timer you need. Some of them give 10ms resolutions. Those functions are easier to use. Others require more work, but give much higher resolution (and might cause you some headaches in some environments. Your dev machine might work fine, though).
http://www.geisswerks.com/ryan/FAQS/timing.html
This articles mentions:
timeGetTime
RDTSC (a processor feature, not an OS feature)
QueryPerformanceCounter
C++ works on many platforms. Why not use something that also works on many platforms, such as the Boost libraries.
Look at the documentation for the Boost Timer Library
I believe that it is a header-only library, which means that it is simple to setup and use...