Timers to measure latency - c++

When measuring network latency (time ack received - time msg sent) in any protocol over TCP, what timer would you recommend to use and why? What resolution does it have? What are other advantages/disadvantages?
Optional: how does it work?
Optional: what timer would you NOT use and why?
I'm looking mostly for Windows / C++ solutions, but if you'd like to comment on other systems, feel free to do so.
(Currently we use GetTickCount(), but it's not a very accurate timer.)

This is a copy of my answer from: C++ Timer function to provide time in nano seconds
For Linux (and BSD) you want to use clock_gettime().
#include <sys/time.h>
int main()
{
timespec ts;
// clock_gettime(CLOCK_MONOTONIC, &ts); // Works on FreeBSD
clock_gettime(CLOCK_REALTIME, &ts); // Works on Linux
}
For windows you want to use the QueryPerformanceCounter. And here is more on QPC
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).

You mentioned that you use GetTickCount(), so I'm going to recommend that you take a look at QueryPerformanceCounter().

There is really no substitute for the rdtsc instruction. You cannot be sure of what resolution the QueryPerformanceCounter will support. Some have a very large granularity (low increment rate/frequency), some return nothing at all.
Instead, I recommend you use the rdtsc instruction. It does not require any OS implementation and returns the number of CPU internal clock cycles that have elapsed since the computer/processor/core was powered up. For a 3 GHz processor that's 3 billion increments per second - it doesn't get more precise than that, now does it? This instruction is available for x86-32 and -64 beginning with the Pentium or Pentium MMX. It should therefore be accessible from x86 Linuxes as well.
There are plenty of posts about it here on stackoverflow.com. I've written a few myself ...

Related

Similar functionality to CLOCK_MONOTONIC in Windows (and Windows Embedded CE)

I have some C++ Windows code which needs to compute time intervals. For that, it uses GetCurrentFT if it detects that is running in WinCE and GetSystemTimeAsFileTime for other Windows platforms. However, if I'm not mistaken, this might be vulnerable to system clock manipulations (i.e. someone changing the system clock would make the measured time intervals unreliable).
Is there something similar to UNIX's CLOCK_MONOTONIC for these platforms (both WinCE and the other Windows platforms) which would make use of a monotonically increasing counter and not the system clock?
std::chrono::steady_clock is monotonic and not vulnerable to system time changes. I don't know if Microsoft supports C++11 for WinCE though.
I did end up using GetTickCount(), which worked just fine.
As per Microsoft's documentation:
The number of milliseconds that have elapsed since the system was
started indicates success.
It is a monotonic counter, which is what I was looking for. The granularity of the returned value depends on the hardware, but seems to be enough for my purposes in the hardware I'm using.

Is QueryPerformanceFrequency accurate when using HPET?

I'm playing around with QueryPerformanceFrequency.
It used to return 3.6 Mhz, but it was not enough for what I was trying to do.
I've enabled HPET using this command bcdedit /set useplatformclock true. Now it returns 14.3 Mhz. It's great it's more precise... excepted it's not. I quickly realized that I did not get the granularity I expected.
If I try to poll QueryPerformanceCounter until it ticks, the smallest increment I can get is 11, which means 1.27Mhz. If I try to count the number of different values I can get from QueryPerformanceCounter in one second, I get 1.26Mhz.
So I was wondering is there was a way to really use the 14.3 Mhz to their full extent ?
I'm using windows 7, 64 bit system, visual studio 2008.
Using the HPET hardware as a source for QueryPerformanceCounter (QPC) is known to be assosiated with large overheads.
QPC is an expensive call when configured with HPET.
It provides 14.3 MHz which suggests high accuracy but as you found, it can't be called fast enough to actually resolve that frequency.
Therefore Microsoft has turned into the CPUs time stamp counter (TSC) as a source for QPC whenever the hardware is capable doing so. TSC queries have much lower overhead. The associated frequency used for QPC is typically the CPU frequency divided by 1024; also typically a few MHz.
The call of QPC in TSC mode is so fast that a lot of consecutive calls may show the same result (typically approx. 20-30 calls or 15 - 20 ns/call).
This way you may obtain typical resolutions of approx. 0.3 us (on a 3.4 GHz CPU).
You observed 3.6 MHz before you switched to HPET. That's likely the signiture of the systems ACPI PM timer (3579545 Hz), which indicates that you were not operating on TSC based QPC before switching to HPET.
So either way, running HPET or ACPI PM timer results in a usable resoluion in the range of a few MHz. Both cannot expose the full resolution given by the performance counter frequency (PCF) because the call to QPC is too expensive. Only The TSC based QPC is fast enough and capable to actually oversample the QPC.
Microsoft has just recently released more detailed information about this matter:
See Acquiring high-resolution time stamps (MSDN 2014) for the details.
This is a comprehensive article with lots of examples and detailed description. A must read for users of QPC.
...a way to really use the 14.3 Mhz to their full extent ?
Unfortunately not.
You can run Coreinfo.exe utility from Windows Sysinternals. Sysinternals has moved to Microsoft technet. Here is the link: Sysinternals System Information Utilities. This will give you an answer to the question: How can I check if my system has a non-invariant TSC?
Summary: The best resolution/accuracy/granularty is obtained by QPC based on TSC.
BTW: The proper choice of hardware as resource for QPC also influences the call expense of the new GetSystemTimePreciseAsFileTime function (Windows 8 desktop and upwards) because it internally uses QPC.

very interesting result from latency test of IPC on Linux 2.6.18

I am doing a performance (latency) test on unix socket under linux 2.6.18,
a process A writes 1024 bytes to process B on each 10 ms, and the result shows average latency is 20 us with small standard deviation(2~3 us).
The test becomes interesting when I run some additional CPU-bound processes simultaneously with process A&B, these new process is very cache-friendly such as a busy loop of simple math calculation, but as a result which surprises me, the IPC latency suddenly goes down, become 15 us on average.
As far as I know, to improve interactivity the O(1) scheduler(2.6 prior to 2.6.23) rewards IO-bound process by some heuristic method, but this can't explain why the speed becomes faster even than the first case.
I have also considered that if the Linux do some special case of busy-loop when process A was rewarded, but it seems not by further test.
This really confuses me.
my configuration:
CPU: Intel(R) Xeon(R) CPU E5-2609 0 # 2.40GHz with 10M L3 cache
MEM: 32G
OS: Linux 2.6.18-308.el5 SMP x86_64
I suspect that some power-saving feature of the hardware is at work here. A 10ms sleep is more than enough time for modern hardware to enter a low-power state. When you're looking at things at the microsecond level, there is a real, measurable latency to come out of a power-saving state.
My guess is that running the "busy" program in parallel prevents the hardware from entering a low power state. Standard things to try:
At the BIOS level, disable any and all power-saving features including C-states
At the OS level, disable cpuspeed (or whatever frequency scaling program your distro uses)
Try booting with the "idle=poll" kernel parameter
That last suggestion is especially important for Sandy Bridge CPUs (which is what you have), at least with RHEL/CentOS 5.x (which I'm guessing you're running). I found the Linux kernel would still override some BIOS settings. Other Linux kernel params that may help you:
intel_idle.max_state=0
processor.max_cstate=0

Finding out the CPU clock frequency (per core, per processor)

Programs like CPUz are very good at giving in depth information about the system (bus speed, memory timings, etc.)
However, is there a programmatic way of calculating the per core (and per processor, in multi processor systems with multiple cores per CPU) frequency without having to deal with CPU specific info.
I am trying to develop a anti cheating tool (for use with clock limited benchmark competitions) which will be able to record the CPU clock during the benchmark run for all the active cores in the system (across all processors.)
I'll expand on my comments here. This is too big and in-depth for me to fit in the comments.
What you're trying to do is very difficult - to the point of being impractical for the following reasons:
There's no portable way to get the processor frequency. rdtsc does NOT always give the correct frequency due to effects such as SpeedStep and Turbo Boost.
All known methods to measure frequency require an accurate measurement of time. However, a determined cheater can tamper with all the clocks and timers in the system.
Accurately reading either the processor frequency as well as time in a tamper-proof way will require kernel-level access. This implies driver signing for Windows.
There's no portable way to get the processor frequency:
The "easy" way to get the CPU frequency is to call rdtsc twice with a fixed time-duration in between. Then dividing out the difference will give you the frequency.
The problem is that rdtsc does not give the true frequency of the processor. Because real-time applications such as games rely on it, rdtsc needs to be consistent through CPU throttling and Turbo Boost. So once your system boots, rdtsc will always run at the same rate (unless you start messing with the bus speeds with SetFSB or something).
For example, on my Core i7 2600K, rdtsc will always show the frequency at 3.4 GHz. But in reality, it idles at 1.6 GHz and clocks up to 4.6 GHz under load via the overclocked Turbo Boost multiplier at 46x.
But once you find a way to measure the true frequency, (or you're happy enough with rdtsc), you can easily get the frequency of each core using thread-affinities.
Getting the True Frequency:
To get the true frequency of the processor, you need to access either the MSRs (model-specific registers) or the hardware performance counters.
These are kernel-level instructions and therefore require the use of a driver. If you're attempting this in Windows for the purpose of distribution, you will therefore need to go through the proper driver signing protocol. Furthermore, the code will differ by processor make and model so you will need different detection code for each processor generation.
Once you get to this stage, there are a variety of ways to read the frequency.
On Intel processors, the hardware counters let you count raw CPU cycles. Combined with a method of precisely measuring real time (next section), you can compute the true frequency. The MSRs give you access to other information such as the CPU frequency multiplier.
All known methods to measure frequency require an accurate measurement of time:
This is perhaps the bigger problem. You need a timer to be able to measure the frequency. A capable hacker will be able to tamper with all the clocks that you can use in C/C++.
This includes all of the following:
clock()
gettimeofday()
QueryPerformanceCounter()
etc...
The list goes on and on. In other words, you cannot trust any of the timers as a capable hacker will be able to spoof all of them. For example clock() and gettimeofday() can be fooled by changing the system clock directly within the OS. Fooling QueryPerformanceCounter() is harder.
Getting a True Measurement of Time:
All the clocks listed above are vulnerable because they are often derived off of the same system base clock in some way or another. And that system base clock is often tied to the system base clock - which can be changed after the system has already booted up by means of overclocking utilities.
So the only way to get a reliable and tamper-proof measurement of time is to read external clocks such as the HPET or the ACPI. Unfortunately, these also seem to require kernel-level access.
To Summarize:
Building any sort of tamper-proof benchmark will almost certainly require writing a kernel-mode driver which requires certificate signing for Windows. This is often too much of a burden for casual benchmark writers.
This has resulted in a shortage of tamper-proof benchmarks which has probably contributed to the overall decline of the competitive overclocking community in recent years.
I realise this has already been answered. I also realise this is basically a black art, so please take it or leave it - or offer feedback.
In a quest to find the clock rate on throttled (thanks microsft,hp, and dell) HyperV hosts (unreliable perf counter), and HyperV guests (can only get stock CPU speed, not current), I have managed, through trial error and fluke, to create a loop that loops exactly once per clock.
Code as follows - C# 5.0, SharpDev, 32bit, Target 3.5, Optimize on (crucial), no debuger active (crucial)
long frequency, start, stop;
double multiplier = 1000 * 1000 * 1000;//nano
if (Win32.QueryPerformanceFrequency(out frequency) == false)
throw new Win32Exception();
Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(1);
const int gigahertz= 1000*1000*1000;
const int known_instructions_per_loop = 1;
int iterations = int.MaxValue;
int g = 0;
Win32.QueryPerformanceCounter(out start);
for( i = 0; i < iterations; i++)
{
g++;
g++;
g++;
g++;
}
Win32.QueryPerformanceCounter(out stop);
//normal ticks differs from the WMI data, i.e 3125, when WMI 3201, and CPUZ 3199
var normal_ticks_per_second = frequency * 1000;
var ticks = (double)(stop - start);
var time = (ticks * multiplier) /frequency;
var loops_per_sec = iterations / (time/multiplier);
var instructions_per_loop = normal_ticks_per_second / loops_per_sec;
var ratio = (instructions_per_loop / known_instructions_per_loop);
var actual_freq = normal_ticks_per_second / ratio;
Console.WriteLine( String.Format("Perf counhter freq: {0:n}", normal_ticks_per_second));
Console.WriteLine( String.Format("Loops per sec: {0:n}", loops_per_sec));
Console.WriteLine( String.Format("Perf counter freq div loops per sec: {0:n}", instructions_per_loop));
Console.WriteLine( String.Format("Presumed freq: {0:n}", actual_freq));
Console.WriteLine( String.Format("ratio: {0:n}", ratio));
Notes
25 instructions per loop if debugger is active
Consider running a 2 or 3 seconds loop before hand to spin up the processor (or at least attempt to spin up, knowing how heavily servers are throttled these days)
Tested on a 64bit Core2 and Haswell Pentium and compared against CPU-Z
One of the most simple ways to do it is using RDTSC, but seeing as this is for anti-cheating mechanisms, I'd put this in as a kernel driver or a hyper-visor resident piece of code.
You'd probably also need to roll your own timing code**, which again can be done with RDTSC (QPC as used in the example below uses RDTSC, and its in fact very simple to reverse engineer and use a local copy of, which means to tamper with it, you'd need to tamper with your driver).
void GetProcessorSpeed()
{
CPUInfo* pInfo = this;
LARGE_INTEGER qwWait, qwStart, qwCurrent;
QueryPerformanceCounter(&qwStart);
QueryPerformanceFrequency(&qwWait);
qwWait.QuadPart >>= 5;
unsigned __int64 Start = __rdtsc();
do
{
QueryPerformanceCounter(&qwCurrent);
}while(qwCurrent.QuadPart - qwStart.QuadPart < qwWait.QuadPart);
pInfo->dCPUSpeedMHz = ((__rdtsc() - Start) << 5) / 1000000.0;
}
** I this would be for security as #Mystical mentioned, but as I've never felt the urge to subvert low level system timing mechanisms, there might be more involved, would be nice if Mystical could add something on that :)
I've previously posted on this subject (along with a basic algorithm): here. To my knowledge the algorithm (see the discussion) is very accurate. For example, Windows 7 reports my CPU clock as 2.00 GHz, CPU-Z as 1994-1996 MHz and my algorithm as 1995025-1995075 kHz.
The algorithm performs a lot of loops to do this which causes the CPU frequency to increase to maximum (as it also will during benchmarks) so speed-throttling software won't come into play.
Additional info here and here.
On the question of speed-throttling I really don't see it as a problem unless an application uses the speed values to determine elapsed times and that the times themselves are extremely important. For example, if a division requires x clock cycles to complete it doesn't matter if the CPU is running at 3 GHz or 300 MHz: it will still need x clock cycles and the only difference is that it will complete the division in a tenth of the time at # 3 GHz.
You need to use CallNtPowerInformation. Here's a code sample from putil project.
With this you can get current and max CPU frequency. As far as I know it's not possible to get per-CPU frequency.
One should refer to this white paper: Intel® Turbo Boost Technology in Intel® Core™ Microarchitecture (Nehalem) Based Processors. Basically, produce several reads of the UCC fixed performance counter over a sample period T.
Relative.Freq = Delta(UCC) / T
Where:
Delta() = UCC # period T
- UCC # period T-1
Starting with Nehalem architecture, UCC increase and decrease the number of click ticks relatively to the Unhalted state of the core.
When SpeedStep or Turbo Boost are activated, the estimated frequency using UCC will be measured accordingly; while TSC remains constant. For instance, Turbo Boost in action reveals that Delta(UCC) is greater or equal to Delta(TSC)
Example in function Core_Cycle function at Cyring | CoreFreq GitHub.

Sub-millisecond precision timing in C or C++

What techniques / methods exist for getting sub-millisecond precision timing data in C or C++, and what precision and accuracy do they provide? I'm looking for methods that don't require additional hardware. The application involves waiting for approximately 50 microseconds +/- 1 microsecond while some external hardware collects data.
EDIT: OS is Wndows, probably with VS2010. If I can get drivers and SDK's for the hardware on Linux, I can go there using the latest GCC.
When dealing with off-the-shelf operating systems, accurate timing is an extremely difficult and involved task. If you really need guaranteed timing, the only real option is a full real-time operating system. However if "almost always" is good enough, here are a few tricks you can use that will provide good accuracy under commodity Windows & Linux
Use a Sheilded CPU Basically, this means turn off IRQ affinity for a selected CPU & set the processor affinity mask for all other processes on the machine to ignore your targeted CPU. On your app, set the CPU affinity to run only on your shielded CPU. Effectively, this should prevent the OS from ever suspending your app as it will always be the only runnable process for that CPU.
Never allow let your process willingly yield control to the OS (which is inherently non-deterministic for non realtime OSes). No memory allocation, no sockets, no mutexes, nada. Use the RDTSC to spin in a while loop waiting for your target time to arrive. It'll consume 100% CPU but it's the most accurate way to go.
If number 2 is a bit too draconic, you can 'sleep short' and then burn the CPU up to your target time. Here, you take advantage of the fact that the OS schedules the CPU at set intervals. Usually 100 times per second or 1000 times per second depending on your OS and configuration (On windows you can change the default scheduling period of 100/s to 1000/s using the multimedia API). This can be a little hard to get right but essentially you need determine when the OS scheduling periods occur and calculate the one prior to your target wake time. Sleep for this duration and then, upon waking, spin on RDTSC (if you're on a single CPU... use QueryPerformanceCounter or the Linux equivalent if not) until your target time arrives. Occasionally, OS scheduling will cause you to miss but, generally speaking, this mechanism works pretty good.
It seems like a simple question, but attaining 'good' timing get's exponentially more difficult the tighter your timing constraints are. Good luck!
The hardware (and therefore resolution) varies from machine to machine. On Windows, specifically (I'm not sure about other platforms), you can use QueryPerformanceCounter and QueryPerformanceFrequency, but be aware you should call both from the same thread and there are no strict guarantees about resolution (QueryPerformanceFrequency is allowed to return 0 meaning no high resolution timer is available). However, on most modern desktops, there should be one accurate to microseconds.
boost::datetime has microsecond precision clock but its accuracy depends on the platform.
The documentation states:
ptime microsec_clock::local_time()
"Get the local time using a sub second resolution clock. On Unix systems this is implemented using GetTimeOfDay. On most Win32 platforms it is implemented using ftime. Win32 systems often do not achieve microsecond resolution via this API. If higher resolution is critical to your application test your platform to see the achieved resolution."
http://www.boost.org/doc/libs/1_43_0/doc/html/date_time/posix_time.html#date_time.posix_time.ptime_class
You may try the following:
struct timeval t;
gettimeofday(&t,0x0);
This gives you current timestamp in micro-seconds. I am not sure about the accuracy.
You could try the technique described here, but it's not portable.
Most modern processors have registers for timing or other instrumentation purposes. On x86 since Pentium days there is the RDTSC instruction, for example. You compiler may give you access to this instruction.
See wikipedia for more info.
timeval in sys/time.h has a member 'tv_usec' which is microseconds.
This link and the code below will help illustrate:
http://www.opengroup.org/onlinepubs/000095399/basedefs/sys/time.h.html
timeval start;
timeval finish;
long int sec_diff;
long int mic_diff;
gettimeofday(&start, 0);
cout << "whooo hooo" << endl;
gettimeofday(&finish, 0);
sec_diff = finish.tv_sec - start.tv_sec;
mic_diff = finish.tv_usec - start.tv_usec;
cout << "cout-ing 'whooo hooo' took " << sec_diff << "seconds and " << mic_diff << " micros." << endl;
gettimeofday(&start, 0);
printf("whooo hooo\n");
gettimeofday(&finish, 0);
sec_diff = finish.tv_sec - start.tv_sec;
mic_diff = finish.tv_usec - start.tv_usec;
cout << "fprint-ing 'whooo hooo' took " << sec_diff << "seconds and " << mic_diff << " micros." << endl;
Good luck trying to do that with MS Windows. You need a realtime operating system, that is to say, one where timing is guaranteed repeatable. Windows can switch to another thread or even another process at an inopportune moment. You will also have no control over cache misses.
When I was doing realtime robotic control, I used a very lightweight OS called OnTime RTOS32, which has a partial Windows API emulation layer. I do not know if it would be suitable for what you are doing. However, with Windows, you will probably never be able to prove that it will never fail to give the timely response.
A combination of GetSystemTimeAsFileTime and QueryPerformanceCounter can result in a reliable suite of code to obtain microsecond resolution time services on windows.
See this comment in another thread here.