Timers differences between Win7 & Win10 - c++

I have a application where I use the MinGW implementation of gettimeofday to achieve "precise" timing (~1ms precision) on Win7. It works fine.
However, when using the same code (and even the same *.exe) on Win10, the precision drops drastically to the famous 15.6ms precision, which is not enough for me.
Two questions:
- do you know what can be the root for such discrepancies? (is it a OS config/"features"?)
- how can I fix it ? or, better, is there a precise timer agnostic to the OS config?
NB: std::chrono::high_resolution_clock seems to have the same issue (at least it does show the 15.6ms limit on Win10).

From Hans Passant comments and additional tests on my side, here is a sounder answer:
The 15.6ms (1/64 second) limit is well-known on Windows and is the default behavior. It is possible to lower the limit (e.g. to 1ms, through a call to timeBeginPeriod()) though we are not advise to do so, because this affects the global system timer resolution and the resulting power consumption. For instance, Chrome is notorious for doing this‌​. Hence, due to the global aspect of the timer resolution, one may observe a 1ms precision without explicitly asking for, because of third party programs.
Besides, be aware that std::chrono::high_resolution_clock does not have a valid behavior on windows (both in Visual Studio or MinGW context). So you cannot expect this interface to be a cross-platform solution, and the 15.625ms limit still applies.
Knowing that, how can we deal with it? Well, one can use the timeBeginPeriod() thing to increase precision of some timers but, again, we are not advise to do so: it seems better to use QueryPerformanceCounter() (QPC), which is the primary API for native code looking forward to acquire high-resolution time stamps or measure time intervals according to Microsoft. Note that GPC does count elapsed time (and not CPU cycles). Here is a usage example:
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Activity to be timed
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
//
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second. We use these values
// to convert to the number of elapsed microseconds.
// To guard against loss-of-precision, we convert
// to microseconds *before* dividing by ticks-per-second.
//
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
According to Microsoft, QPC is also suitable in a multicore/multithread context, though it can be less precise/ambiguous:
When you compare performance counter results that are acquired from different threads, consider values that differ by ± 1 tick to have an ambiguous ordering. If the time stamps are taken from the same thread, this ± 1 tick uncertainty doesn't apply. In this context, the term tick refers to a period of time equal to 1 ÷ (the frequency of the performance counter obtained from QueryPerformanceFrequency).
As additional resources, MS also provides an FAQ on how/why use QPC and an explanation on clock/timing in Windows.

Related

How can I measure sub-second durations?

In my calculator-like program, the user selects what and how many to compute (eg. how many digits of pi, how many prime numbers etc.). I use time(0) to check for the computation time elapsed in order to trigger a timeout condition. If the computation completes without timeout, I will also print the computation time taken, the value of which is stored in a double, the return type of difftime().
I just found out that the time values calculated are in seconds only. I don't want a user input of 100 and 10000 results to both print a computation duration of 0e0 seconds. I want them to print, for example, durations of 1.23e-6 and 4.56e-3 seconds respectively (as accurate as the machine can measure - I am more acquainted to the accuracy provided in Java and with the accuracies in scientific measurements so it's a personal preference).
I have seen the answers to other questions, but they don't help because 1) I will not be multi-threading (not preferred in my work environment). 2) I cannot use C++11 or later.
How can I obtain time duration values more accurate than seconds as integral values given the stated constraints?
Edit: Platform & machine-independent solutions preferred, otherwise Windows will do, thanks!
Edit 2: My notebook is also not connected to the Internet, so no downloading of external libraries like Boost (is that what Boost is?). I'll have to code anything myself.
You can use QueryPerformanceCounter (QPC) which is part of the Windows API to do high-resolution time measurements.
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Activity to be timed
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
//
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second. We use these values
// to convert to the number of elapsed microseconds.
// To guard against loss-of-precision, we convert
// to microseconds *before* dividing by ticks-per-second.
//
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
On Windows, the simplest solution is to use GetTickCount, which returns the number of milliseconds since the computer was started.
#include <windows.h>
...
DWORD before = GetTickCount();
...
DWORD duration = GetTickCount() - before;
std::cout<<"It took "<<duration<<"ms\n";
Caveats:
it works only on Windows;
the resolution (milliseconds) is not stellar;
given that the result is a 32 bit integer, it wraps around after one month or something; thus, you cannot measure stuff longer than that; a possible solution is to use GetTickCount64, which however is available only from Vista onwards;
since systems with the uptime of more than one month are actually quite common, you may indeed have to deal with results bigger than 231; thus, make sure to always keep such values in a DWORD (or an uint32_t), without casting them to int, or you are risking signed integer overflow. Another option is to just store them in a 64 bit signed integer (or a double) and forget the difficulties of dealing with unsigned integers.
I realize the compiler you're using doesn't support it, but for reference purposes the C++11 solution is simple...
auto start = std::chrono::high_resolution_clock::now();
auto end = std::chrono::high_resolution_clock::now();
long ts = std::chrono::duration<long, std::chrono::nano>(end - start).count();

why clock() does not work on the the cluster machine

I want to get the running time of part of my code.
my C++ code is like:
...
time_t t1 = clock();
/*
Here is my core code.
*/
time_t t2 = clock();
cout <<"Running time: "<< (1000.0 * (t2 - t1)) / CLOCKS_PER_SEC << "ms" << endl;
...
This code works well on my laptop.(Opensuse,g++ and clang++, Core i5).
But it does not work well on the cluster in the department.
(Ubuntu, g++, amd Opteron and intel Xeon)
I always get some integer running time :
like : 0ms or 10ms or 20ms.
What cause that ? Why? Thanks!
Clocks are not guaranteed to be exact down to ~10-44 seconds (Planck time), they often have a minimal resolution. The Linux man page implies this with:
The clock() function returns an approximation of processor time used by the program.
and so does the ISO standard C11 7.27.2.1 The clock function /3:
The clock function returns the implementation’s best approximation of ...
and in 7.27.1 Components of time /4:
The range and precision of times representable in clock_t and time_t are implementation-defined.
From your (admittedly limited) sample data, it looks like the minimum resolution of your cluster machines is on the order of 10ms.
In any case, you have several possibilities if you need a finer resolution.
First, find a (probably implementation-specific) means of timing things more accurately.
Second, don't do it once. Do it a thousand times in a tight loop and then just divide the time taken by 1000. That should roughly increase your resolution a thousand-fold.
Thirdly, think about the implication that your code only takes 50ms at the outside. Unless you have a pressing need to execute it more than twenty times a second (assuming you have no other code to run), it may not be an issue.
On that last point, think of things like "What's the longest a user will have to wait before they get annoyed?". The answer to that would vary but half a second might be fine in most situations.
Since 50ms code could run ten times over during that time, you may want to ignore it. You'd be better off concentrating on code that has a clearly larger impact.

C++ fine granular time

The following piece of code gives 0 as runtime of the function. Can anybody point out the error?
struct timeval start,end;
long seconds,useconds;
gettimeofday(&start, NULL);
int optimalpfs=optimal(n,ref,count);
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
cout<<"\nOptimal Runtime is "<<opt_runtime<<"\n";
I get both start and end time as the same.I get the following output
Optimal Runtime is 0
Tell me the error please.
POSIX 1003.1b-1993 specifies interfaces for clock_gettime() (and clock_getres()), and offers that with the MON option there can be a type of clock with a clockid_t value of CLOCK_MONOTONIC (so that your timer isn't affected by system time adjustments). If available on your system then these functions return a structure which has potential resolution down to one nanosecond, though the latter function will tell you exactly what resolution the clock has.
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* and nanoseconds */
};
You may still need to run your test function in a loop many times for the clock to register any time elapsed beyond its resolution, and perhaps you'll want to run your loop enough times to last at least an order of magnitude more time than the clock's resolution.
Note though that apparently the Linux folks mis-read the POSIX.1b specifications and/or didn't understand the definition of a monotonically increasing time clock, and their CLOCK_MONOTONIC clock is affected by system time adjustments, so you have to use their invented non-standard CLOCK_MONOTONIC_RAW clock to get a real monotonic time clock.
Alternately one could use the related POSIX.1 timer_settime() call to set a timer running, a signal handler to catch the signal delivered by the timer, and timer_getoverrun() to find out how much time elapsed between the queuing of the signal and its final delivery, and then set your loop to run until the timer goes off, counting the number of iterations in the time interval that was set, plus the overrun.
Of course on a preemptive multi-tasking system these clocks and timers will run even while your process is not running, so they are not really very useful for benchmarking.
Slightly more rare is the optional POSIX.1-1999 clockid_t value of CLOCK_PROCESS_CPUTIME_ID, indicated by the presence of the _POSIX_CPUTIME from <time.h>, which represents the CPU-time clock of the calling process, giving values representing the amount of execution time of the invoking process. (Even more rare is the TCT option of clockid_t of CLOCK_THREAD_CPUTIME_ID, indicated by the _POSIX_THREAD_CPUTIME macro, which represents the CPU time clock, giving values representing the amount of execution time of the invoking thread.)
Unfortunately POSIX makes no mention of whether these so-called CPUTIME clocks count just user time, or both user and system (and interrupt) time, accumulated by the process or thread, so if your code under profiling makes any system calls then the amount of time spent in kernel mode may, or may not, be represented.
Even worse, on multi-processor systems, the values of the CPUTIME clocks may be completely bogus if your process happens to migrate from one CPU to another during its execution. The timers implementing these CPUTIME clocks may also run at different speeds on different CPU cores, and at different times, further complicating what they mean. I.e. they may not mean anything related to real wall-clock time, but only be an indication of the number of CPU cycles (which may still be useful for benchmarking so long as relative times are always used and the user is aware that execution time may vary depending on external factors). Even worse it has been reported that on Linux CPU TimeStampCounter-based CPUTIME clocks may even report the time that a process has slept.
If your system has a good working getrusage() system call then it will hopefully be able to give you a struct timeval for each of the the actual user and system times separately consumed by your process while it was running. However since this puts you back to a microsecond clock at best then you'll need to run your test code enough times repeatedly to get a more accurate timing, calling getrusage() once before the loop, and again afterwards, and the calculating the differences between the times given. For simple algorithms this might mean running them millions of times, or more. Note also that on many systems the division between user time and system time is done somewhat arbitrarily and if examined separately in a repeated loop one or the other can even appear to run backwards. However if your algorithm makes no system calls then summing the time deltas should still be a fair total time for your code execution.
BTW, take care when comparing time values such that you don't overflow or end up with a negative value in a field, either as #Nim suggests, or perhaps like this (from NetBSD's <sys/time.h>):
#define timersub(tvp, uvp, vvp) \
do { \
(vvp)->tv_sec = (tvp)->tv_sec - (uvp)->tv_sec; \
(vvp)->tv_usec = (tvp)->tv_usec - (uvp)->tv_usec; \
if ((vvp)->tv_usec < 0) { \
(vvp)->tv_sec--; \
(vvp)->tv_usec += 1000000; \
} \
} while (0)
(you might even want to be more paranoid that tv_usec is in range)
One more important note about benchmarking: make sure your function is actually being called, ideally by examining the assembly output from your compiler. Compiling your function in a separate source module from the driver loop usually convinces the optimizer to keep the call. Another trick is to have it return a value that you assign inside the loop to a variable defined as volatile.
You've got weird mix of floats and ints here:
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
Try using:
long opt_runtime = (long)(seconds * 1000 + (float)useconds/1000);
This way you'll get your results in milliseconds.
The execution time of optimal(...) is less than the granularity of gettimeofday(...). This likely happes on Windows. On Windows the typical granularity is up to 20 ms. I've answered a related gettimeofday(...) question here.
For Linux I asked How is the microsecond time of linux gettimeofday() obtained and what is its accuracy? and got a good result.
More information on how to obtain accurate timing is described in this SO answer.
I normally do such a calculation as:
long long ss = start.tv_sec * 1000000LL + start.tv_usec;
long long es = end.tv_sec * 1000000LL + end.tv_usec;
Then do a difference
long long microsec_diff = es - ss;
Now convert as required:
double seconds = microsec_diff / 1000000.;
Normally, I don't bother with the last step, do all timings in microseconds.

Timing and Benchmarking

If I want to compare the speed of two algorithms, what's the best way?
I am familiar with the mathematical way, and I know I don't full master it yet =/
I'm interested in knowing how to time two algorithms with good precision on Windows.
What are the best APIs to use, is time() from stdio.h reliable or do I need something better?
An example would be good too!
Thank you!
Unfortunately this is not such an easy task.
As previously mentioned QueryPerformanceCounter is a viable choice.
Other possibilities are:
- GetTickCount (not advisable, since its accuracy is worse than 30ms)
- timeGetTime: By default it has a 15ms accuracy. This corresponds to the default time allocated by the task scheduler (15ms on my computer). You can force timeGetTime to be more accurate, by changing a system wide setting that rules the time allocated by the scheduler : a call to timeBeginPeriod can do this. However, this should only be used as a temporary system hack! Please do not use it in your release code!
- Query the processor's time stamp counter : this requires assembler programming and I would not recommend it.
As far as QueryPerformanceCounter is concerned, you can find an easy to use wrapper here : http://www.codeproject.com/KB/datetime/perftimer.aspx
You can use it this way :
CPerfTimer t;
t.Start();
CallExpensiveTask();
std::cout << "Time (ms) " << t.Elapsedms();
A few advices however :
As mentionned, run your function a lot of times in order to get a reliable measure
Beware of the task scheduler :
On an average computer each process is given 15ms before switching to another process. If the task scheduler switches task during the call to your measured function, you might measure a time that is much higher (around 15ms higher)
Note that sleep(1) yields to a 15ms pause (since the scheduler will switch to a different process)
Remember that QueryPerformanceCounter can (rarely) give inaccurate results :
On multicore processors, you might sometimes see a negative time lapse (!). In that case, you should redo your measure (see http://www.virtualdub.org/blog/pivot/entry.php?id=106 )
Temporary solutions to counteract the scheduler effect:
- Raise your process priority (you can do it through the task manager)
- Hack the scheduler : the timeBeginPeriod (http://msdn.microsoft.com/en-us/library/dd757624(v=VS.85).aspx) provided by microsoft has the power to change the time allocated to each task from 15ms to lower values (remember to not include this is your release code, since this is a system wide setting that can reduce the global performance...)
Some more notes about the processor's time stamp counter
I did not test its accuracy myself, but http://en.wikipedia.org/wiki/Time_Stamp_Counter is a good source of information about it, and about its limitations (esp. on multicore processors, and also on processors with variable clock frequencies)
AFAIK, it is advisable to use this kind of timer on a single thread for which you set the processor affinity
An example of implementation (found at http://developer.nvidia.com/object/timer_function_performance.html) could be:
#pragma warning (disable : 4035) // disable no return value warning
__forceinline DWORD GetPentiumCounter()
{
__asm
{
xor eax,eax // VC won't realize that eax is modified w/out this
// instruction to modify the val.
// Problem shows up in release mode builds
_emit 0x0F // Pentium high-freq counter to edx;eax
_emit 0x31 // only care about low 32 bits in eax
xor edx,edx // so VC gets that edx is modified
}
}
#pragma warning (pop)
What you are looking for is QueryPerformanceCounter and QueryPerformanceFrequency. These are Windows API functions to access high resolution timers. They should yield better precision then time or GetSystemTime.
LARGE_INTEGER time1;
QueryPerformanceCounter(&time1);
// Your code
LARGE_INTEGER time2;
QueryPerformanceCounter(&time2);
LARGE_INTEGER ticksPerSecond;
QueryPerformanceFrequency(&ticksPerSecond);
double seconds = (double)(time2.QuadPart - time1.QuadPart) / ticksPerSecond.QuadPart;
And here I found an implementation of a CStopWatch class that could be used as an example.
QueryPerformanceCounter/QueryPerformanceFrequency. AFAIK, they are fixed now, and actually measure cycles. Just remember to rise the priority of the thread so that it would have less impact from other threads while you measure.
Preempting can cause quite a bit of error if the runtime of the measurable code is low.
So rise priority and rerun tests multiple times.

Measuring execution time of a call to system() in C++

I have found some code on measuring execution time here
http://www.dreamincode.net/forums/index.php?showtopic=24685
However, it does not seem to work for calls to system(). I imagine this is because the execution jumps out of the current process.
clock_t begin=clock();
system(something);
clock_t end=clock();
cout<<"Execution time: "<<diffclock(end,begin)<<" s."<<endl;
Then
double diffclock(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks)/(CLOCKS_PER_SEC);
return diffms;
}
However this always returns 0 seconds... Is there another method that will work?
Also, this is in Linux.
Edit: Also, just to add, the execution time is in the order of hours. So accuracy is not really an issue.
Thanks!
Have you considered using gettimeofday?
struct timeval tv;
struct timeval start_tv;
gettimeofday(&start_tv, NULL);
system(something);
double elapsed = 0.0;
gettimeofday(&tv, NULL);
elapsed = (tv.tv_sec - start_tv.tv_sec) +
(tv.tv_usec - start_tv.tv_usec) / 1000000.0;
Unfortunately clock() only has one second resolution on Linux (even though it returns the time in units of microseconds).
Many people use gettimeofday() for benchmarking, but that measures elapsed time - not time used by this process/thread - so isn't ideal. Obviously if your system is more or less idle and your tests are quite long then you can average the results. Normally less of a problem but still worth knowing about is that the time returned by gettimeofday() is non-monatonic - it can jump around a bit e.g. when your system first connects to an NTP time server.
The best thing to use for benchmarking is clock_gettime() with whichever option is most suitable for your task.
CLOCK_THREAD_CPUTIME_ID - Thread-specific CPU-time clock.
CLOCK_PROCESS_CPUTIME_ID - High-resolution per-process timer from the CPU.
CLOCK_MONOTONIC - Represents monotonic time since some unspecified starting point.
CLOCK_REALTIME - System-wide realtime clock.
NOTE though, that not all options are supported on all Linux platforms - except clock_gettime(CLOCK_REALTIME) which is equivalent to gettimeofday().
Useful link: Profiling Code Using clock_gettime
Tuomas Pelkonen already presented the gettimeofday method that allows to get times with a resolution to the microsecond.
In his example he goes on to convert to double. I personally have wrapped the timeval struct into a class of my own that keep the counts into seconds and microseconds as integers and handle the add and minus operations correctly.
I prefer to keep integers (with exact maths) rather than get to floating points numbers and all their woes when I can.