clock_t() overflow on 32-bit machine

clock_t() overflow on 32-bit machine - c++

For statistical purposes I want to accumulate the whole CPU-time used for a function of a program, in microseconds. It must work in two systems, one where sizeof(clock_t) = 8 (RedHat) and another one where sizeof(clock_t) = 4 (AIX). In both machines clock_t is a signed integer type and CLOCKS_PER_SEC = 1000000 (= one microsecond, but I don't do such assumption in code and use the macro instead).
What I have is equivalent to something like this (but encapsulated in some fancy classes):
typedef unsigned long long u64;
u64 accum_ticks = 0;
void f()
{
clock_t beg = clock();
work();
clock_t end = clock();
accum_ticks += (u64)(end - beg); // (1)
}
u64 elapsed_CPU_us()
{
return accum_tick * 1e+6 / CLOCKS_PER_SEC;
}
But, in the 32-bit AIX machine where clock_t is an int, it will overflow after 35m47s. Suppose that in some call beg equals 35m43s since the program started, and work() takes 10 CPU-seconds, causing end to overflow. Can I trust line (1) for this and subsequental calls to f() from now on? f() is guaranteed to never take more than 35 minutes of execution, of course.
In case I can't trust line (1) at all even in my particular machine, what alternatives do I have that doesn't imply importing any third-party library? (I can't copy-paste libraries to the system and I can't use <chrono> because in our AIX machines it isn't available).
NOTE: I can use kernel headers and the precision I need is in microseconds.

An alternate suggestion: Don't use clock. It's so underspecified it's nigh impossible to write code that will work fully portably, handling possible wraparound for 32 bit integer clock_t, integer vs. floating point clock_t, etc. (and by the time you write it, you've written so much ugliness you've lost whatever simplicity clock provided).
Instead, use getrusage. It's not perfect, and it might do a little more than you strictly need, but:
The times it returns are guaranteed to operate relative to 0 (where the value returned by clock at the beginning of a program could be anything)
It lets you specify if you want to include stats from child processes you've waited on (clock either does or doesn't, in a non-portable fashion)
It separates the user and system CPU times; you can use either one, or both, your choice
Each time is expressed explicitly in terms of a pair of values, a time_t number of seconds, and a suseconds_t number of additional microseconds. Since it doesn't try to encode a total microsecond count into a single time_t/clock_t (which might be 32 bits), wraparound can't occur until you've hit at least 68 years of CPU time (if you manage that, on a system with 32 bit time_t, I want to know your IT folks; only way I can imagine hitting that is on a system with hundreds of cores, running weeks, and any such system would be 64 bit at this point).
The parts of the result you need are specified by POSIX, so it's portable to just about everywhere but Windows (where you're stuck writing preprocessor controlled code to switch to GetProcessTimes when compiled for Windows)
Conveniently, since you're on POSIX systems (I think?), clock is already expressed as microseconds, not real ticks (POSIX specifies that CLOCKS_PER_SEC equals 1000000), so the values already align. You can just rewrite your function as:
#include <sys/time.h>
#include <sys/resource.h>
static inline u64 elapsed(const struct timeval *beg, const struct timeval *end)
{
return (end->tv_sec - beg->tv_sec) * 1000000ULL + (end->tv_usec - beg->tv_usec);
}
void f()
{
struct rusage beg, end;
// Not checking return codes, because only two documented failure cases are passing
// an unmapped memory address for the struct addr or an invalid who flag, neither of which
// we're doing, easily verified by inspection
getrusage(RUSAGE_SELF, &beg);
work();
getrusage(RUSAGE_SELF, &end);
accum_ticks += elapsed(&beg.ru_utime, &end.ru_utime);
// And if you want to include system time as well, add:
accum_ticks += elapsed(&beg.ru_stime, &end.ru_stime);
}
u64 elapsed_CPU_us()
{
return accum_ticks; // It's already stored natively in microseconds
}
On Linux 2.6.26+, you can replace RUSAGE_SELF with RUSAGE_THREAD to limit to the resources used solely by the calling thread alone, not just the calling process (which might help if other threads are doing unrelated work and you don't want their stats polluting yours), in exchange for less portability.
Yes, it's a little more work to compute the time (two additions/subtractions, one multiplications by a constant, doubled if you want both user and system time, where clock in the simplest usage is a single subtraction), but:
Handling clock wraparound adds more work (and branches work, which this code doesn't have; admittedly, it's a fairly predictable branch), narrowing the gap
Integer multiplication is roughly as cheap as addition and subtraction on modern chips (the latest x86-64 chips perform integer multiply in a single clock cycle), so you're not adding orders of magnitude more work, and in exchange, you get more control, more guarantees, and greater portability
Note: You might see code using clock_gettime with clock ID CLOCK_PROCESS_CPUTIME_ID, which would simplify your code when you just want total CPU time, not split up by user vs. system, without all the other stuff getrusage provides (perhaps it would be faster, simply by virtue of retrieving less data). Unfortunately, while clock_gettime is guaranteed by POSIX, the CLOCK_PROCESS_CPUTIME_ID clock ID is not, so you can't use it reliably on all POSIX systems (FreeBSD at least seems to lack it). All the parts of getrusage we're relying on are fully standard, so it's safe.

Related

Measuing CPU clock speed

I am trying to measure the speed of the CPU.I am not sure how much my method is accurate. Basicly, I tried an empty for loop with values like UINT_MAX but the program terminated quickly so I tried UINT_MAX * 3 and so on...
Then I realized that the compiler is optimizing away the loop, so I added a volatile variable to prevent optimization. The following program takes 1.5 seconds approximately to finish. I want to know how accurate is this algorithm for measuring the clock speed. Also,how do I know how many core's are being involved in the process?
#include <iostream>
#include <limits.h>
#include <time.h>
using namespace std;
int main(void)
{
volatile int v_obj = 0;
unsigned long A, B = 0, C = UINT32_MAX;
clock_t t1, t2;
t1 = clock();
for (A = 0; A < C; A++) {
(void)v_obj;
}
t2 = clock();
std::cout << (double)(t2 - t1) / CLOCKS_PER_SEC << std::endl;
double t = (double)(t2 - t1) / CLOCKS_PER_SEC;
unsigned long clock_speed = (unsigned long)(C / t);
std::cout << "Clock speed : " << clock_speed << std::endl;
return 0;
}

This doesn't measure clock speed at all, it measures how many loop iterations can be done per second. There's no rule that says one iteration will run per clock cycle. It may be the case, and you may have actually found it to be the case - certainly with optimized code and a reasonable CPU, a useless loop shouldn't run much slower than that. It could run at half speed though, some processors are not able to retire more than 1 taken branch every 2 cycles. And on esoteric targets, all bets are off.
So no, this doesn't measure clock cycles, except accidentally. In general it's extremely hard to get an empirical clock speed (you can ask your OS what it thinks the maximum clock speed and current clock speed are, see below), because
If you measure how much wall clock time a loop takes, you must know (at least approximately) the number of cycles per iteration. That's a bad enough problem in assembly, requiring fairly detailed knowledge of the expected microarchitectures (maybe a long chain of dependent instructions that each could only reasonably take 1 cycle, like add eax, 1? a long enough chain that differences in the test/branch throughput become small enough to ignore), so obviously anything you do there is not portable and will have assumptions built into it may become false (actually there is an other answer on SO that does this and assumes that addps has a latency of 3, which it doesn't anymore on Skylake, and didn't have on old AMDs). In C? Give up now. The compiler might be rolling some random code generator, and relying on it to be reasonable is like doing the same with a bear. Guessing the number of cycles per iteration of code you neither control nor even know is just folly. If it's just on your own machine you can check the code, but then you could just check the clock speed manually too so..
If you measure the number of clock cycles elapsed in a given amount of wall clock time.. but this is tricky. Because rdtsc doesn't measure clock cycles (not anymore), and nothing else gets any closer. You can measure something, but with frequency scaling and turbo, it generally won't be actual clock cycles. You can get actual clock cycles from a performance counter, but you can't do that from user mode. Obviously any way you try to do this is not portable, because you can't portably ask for the number of elapsed clock cycles.
So if you're doing this for actual information and not just to mess around, you should probably just ask the OS. For Windows, query WMI for CurrentClockSpeed or MaxClockSpeed, whichever one you want. On Linux there's stuff in /proc/cpuinfo. Still not portable, but then, no solution is.
As for
how do I know how many core's are being involved in the process?
1. Of course your thread may migrate between cores, but since you only have one thread, it's on only one core at any time.

A good optimizer may remove the loop, since
for (A = 0; A < C; A++) {
(void)v_obj;
}
has the same effect on the program state as;
A = C;
So the optimizer is entirely free to unwind your loop.
So you cannot measure CPU speed this way as it depends on the compiler as much as it does on the computer (not to mention the variable clock speed and multicore architecture already mentioned)

C++ , Timer, Milliseconds

#include <iostream>
#include <conio.h>
#include <ctime>
using namespace std;
double diffclock(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks)/(CLOCKS_PER_SEC/1000);
return diffms;
}
int main()
{
clock_t start = clock();
for(int i=0;;i++)
{
if(i==10000)break;
}
clock_t end = clock();
cout << diffclock(start,end)<<endl;
getch();
return 0;
}
So my problems comes to that it returns me a 0, well to be stright i want to check how much time my program does operate...
I found tons of crap over the internet well mostly it comes to the same point of getting a 0 beacuse the start and the end is the same
This problems goes to C++ remeber : <

There are a few problems in here. The first is that you obviously switched start and stop time when passing to diffclock() function. The second problem is optimization. Any reasonably smart compiler with optimizations enabled would simply throw the entire loop away as it does not have any side effects. But even you fix the above problems, the program would most likely still print 0. If you try to imagine doing billions operations per second, throw sophisticated out of order execution, prediction and tons of other technologies employed by modern CPUs, even a CPU may optimize your loop away. But even if it doesn't, you'd need a lot more than 10K iterations in order to make it run longer. You'd probably need your program to run for a second or two in order to get clock() reflect anything.
But the most important problem is clock() itself. That function is not suitable for any time of performance measurements whatsoever. What it does is gives you an approximation of processor time used by the program. Aside of vague nature of the approximation method that might be used by any given implementation (since standard doesn't require it of anything specific), POSIX standard also requires CLOCKS_PER_SEC to be equal to 1000000 independent of the actual resolution. In other words — it doesn't matter how precise the clock is, it doesn't matter at what frequency your CPU is running. To put simply — it is a totally useless number and therefore a totally useless function. The only reason why it still exists is probably for historical reasons. So, please do not use it.
To achieve what you are looking for, people have used to read the CPU Time Stamp also known as "RDTSC" by the name of the corresponding CPU instruction used to read it. These days, however, this is also mostly useless because:
Modern operating systems can easily migrate the program from one CPU to another. You can imagine that reading time stamp on another CPU after running for a second on another doesn't make a lot of sense. It is only in latest Intel CPUs the counter is synchronized across CPU cores. All in all, it is still possible to do this, but a lot of extra care must be taken (i.e. once can setup the affinity for the process, etc. etc).
Measuring CPU instructions of the program oftentimes does not give an accurate picture of how much time it is actually using. This is because in real programs there could be some system calls where the work is performed by the OS kernel on behalf of the process. In that case, that time is not included.
It could also happen that OS suspends an execution of the process for a long time. And even though it took only a few instructions to execute, for user it seemed like a second. So such a performance measurement may be useless.
So what to do?
When it comes to profiling, a tool like perf must be used. It can track a number of CPU clocks, cache misses, branches taken, branches missed, a number of times the process was moved from one CPU to another, and so on. It can be used as a tool, or can be embedded into your application (something like PAPI).
And if the question is about actual time spent, people use a wall clock. Preferably, a high-precision one, that is also not a subject to NTP adjustments (monotonic). That shows exactly how much time elapsed, no matter what was going on. For that purpose clock_gettime() can be used. It is part of SUSv2, POSIX.1-2001 standard. Given that use you getch() to keep the terminal open, I'd assume you are using Windows. There, unfortunately, you don't have clock_gettime() and the closest thing would be performance counters API:
BOOL QueryPerformanceFrequency(LARGE_INTEGER *lpFrequency);
BOOL QueryPerformanceCounter(LARGE_INTEGER *lpPerformanceCount);
For a portable solution, the best bet is on std::chrono::high_resolution_clock(). It was introduced in C++11, but is supported by most industrial grade compilers (GCC, Clang, MSVC).
Below is an example of how to use it. Please note that since I know that my CPU will do 10000 increments of an integer way faster than a millisecond, I have changed it to microseconds. I've also declared the counter as volatile in hope that compiler won't optimize it away.
#include <ctime>
#include <chrono>
#include <iostream>
int main()
{
volatile int i = 0; // "volatile" is to ask compiler not to optimize the loop away.
auto start = std::chrono::steady_clock::now();
while (i < 10000) {
++i;
}
auto end = std::chrono::steady_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "It took me " << elapsed.count() << " microseconds." << std::endl;
}
When I compile and run it, it prints:
$ g++ -std=c++11 -Wall -o test ./test.cpp && ./test
It took me 23 microseconds.
Hope it helps. Good Luck!

At a glance, it seems like you are subtracting the larger value from the smaller value. You call:
diffclock( start, end );
But then diffclock is defined as:
double diffclock( clock_t clock1, clock_t clock2 ) {
double diffticks = clock1 - clock2;
double diffms = diffticks / ( CLOCKS_PER_SEC / 1000 );
return diffms;
}
Apart from that, it may have something to do with the way you are converting units. The use of 1000 to convert to milliseconds is different on this page:
http://en.cppreference.com/w/cpp/chrono/c/clock

The problem appears to be the loop is just too short. I tried it on my system and it gave 0 ticks. I checked what diffticks was and it was 0. Increasing the loop size to 100000000, so there was a noticeable time lag and I got -290 as output (bug -- I think that the diffticks should be clock2-clock1 so we should get 290 and not -290). I tried also changing "1000" to "1000.0" in the division and that didn't work.
Compiling with optimization does remove the loop, so you have to not use it, or make the loop "do something", e.g. increment a counter other than the loop counter in the loop body. At least that's what GCC does.

Note: This is available after c++11.
You can use std::chrono library.
std::chrono has two distinct objects. (timepoint and duration). Timepoint represents a point in time, and duration, as we already know the term represents an interval or a span of time.
This c++ library allows us to subtract two timepoints to get a duration of time passed in the interval. So you can set a starting point and a stopping point. Using functions you can also convert them into appropriate units.
Example using high_resolution_clock (which is one of the three clocks this library provides):
#include <chrono>
using namespace std::chrono;
//before running function
auto start = high_resolution_clock::now();
//after calling function
auto stop = high_resolution_clock::now();
Subtract stop and start timepoints and cast it into required units using the duration_cast() function. Predefined units are nanoseconds, microseconds, milliseconds, seconds, minutes, and hours.
auto duration = duration_cast<microseconds>(stop - start);
cout << duration.count() << endl;

First of all you should subtract end - start not vice versa.
Documentation says if value is not available clock() returns -1, did you check that?
What optimization level do you use when compile your program? If optimization is enabled compiler can effectively eliminate your loop entirely.

extending the std::chrono functionality to deal with run-time (non compile-time) constant periods

I have been experimenting with all kind of timers on Linux and OSX, and would like to try and wrap some of them with the same interface used by std::chrono.
That's easy to do for timers that have a well-defined "period" at compile time, e.g. the POSIX clock_gettime() familiy, the clock_get_time() family on OSX, or gettimeofday().
However, there are some useful timers for which the "period" - while constant - is only known at runtime.
For example:
- POSIX states the period of clock(), CLOCKS_PER_SEC, may be a variable on non-XSI systems
- on Linux, the period of times() is given at runtime by sysconf(_SC_CLK_TCK)
- on OSX, the period of mach_absolute_time() is given at runtime by mach_timebase_info()
- on recent Intel processors, the DST register ticks at a constant rate, but of course that can only be determined at runtime
To wrap these timers in the std::chrono interface, one possibility would be to use a period of std::chrono::nanosecond , and convert the value of each timer to nanoseconds. An other approach could be to use a floating point representation. However, both approaches would introduce a (very small) overhead to the now() function, and a (probably small) loss in precision.
The solution I'm trying to pursue is to define a set of classes to represent such "run-time constant" periods, built along the same lines as the std::ratio class.
However I expect that will require rewriting all the related template classes and functions (as they assume constexpr values).
How do I wrap these kind of timers a la std:chrono ?
Or use non-constexpr values for the time period of a clock ?

Does anyone have any experience with wrapping these kind of timers a
la std:chrono ?
Actually I do. And on OSX, one of your platforms of interest. :-)
You mention:
on OSX, the period of mach_absolute_time() is given at runtime by
mach_timebase_info()
Absolutely correct. Also on OSX, the libc++ implementation of high_resolution_clock and steady_clock is actually based on mach_absolute_time. I'm the author of this code, which is open source with a generous license (do anything you want with it as long as you retain the copyright).
Here is the source for libc++'s steady_clock::now(). It is built pretty much the way you surmised. The run time period is converted to nanoseconds prior to returning. On OS X the conversion factor is very often 1, and the code takes advantage of that fact with an optimization. However the code is general enough to handle non-1 conversion factors.
On the first call to now() there's a small cost of querying the run time conversion factor to nanoseconds. In the general case a floating point conversion factor is computed. In the common case (conversion factor == 1) the subsequent cost is calling through a function pointer. I've found that the overhead is really quite reasonable.
On OS X the conversion factor, although not determined until run time, is still a constant (i.e. does not vary as the program executes), so it only needs to be computed once.
If you're in a situation where your period is actually varying dynamically, you'll need more infrastructure to handle this. Essentially you would need to integrate (calculus) the period vs time curve and then compute an average period between two points in time. That would require a constant monitoring of the period as it changes with time, and <chrono> isn't the right tool for that. Such tools are typically handled at the OS level.

[Does anyone have any experience] Or with using non-constexpr values for the time period of a clock ?
After reading through the standard (20.11.5, Class template duration), "period" is expected to be "a specialization of ratio":
Remarks: If Period is not a specialization of ratio, the program is ill-formed.
and all chrono templates rely heavily on constexpr functionality.
Does anyone have any experience with wrapping these kind of timers a la std:chrono ?
I've found here a suggestion to use a duration with period = 1, boost::rational as rep , though without any concrete examples.

I have done a similar thing for my purposes, only for Linux though. You find the code here; feel free to use the code in whatever way you want.
The challenges my implementation addresses overlap partially with the ones mentioned in your question. Specifically:
The tick factor (required to convert from clock ticks to a time unit based on seconds) is retrieved at run time, but only the first time now() is used&ddagger;. If you are concerned about the small overhead this causes, you may call the now() function once at start-up before you measure any actual intervals. The tick factor is stored in a static variable, which means there is still some overhead as – on the lowest level – each call of the now() function implies checking whether the static variable has been initialized. However, this overhead will be the same in each call of now(), so it shouldn't impact measuring time intervals.
I do not convert to nanoseconds by default, because when measuring relatively long periods of time (e.g. a few seconds) this causes overflows very quickly. This is in fact the main reason why I don't use the boost implementation. Instead of converting to nanoseconds, I implement the base unit as a template parameter (called Precision in the code). I use std::ratio from C++11 as template arguments. So I can choose, for example, a clock<micro>, which implies that calling the now() function will internally convert to microseconds rather than nanoseconds, which means I can measure periods of many seconds or minutes without overflows and still with good precision. (This is independent of the unit used to produce output. You can have a clock<micro> and display the result in seconds, etc.)
My clock type, which is called combined_clock combines user time, system time and wall-clock time. There is a boost clock type for this, too, but it's not compatible with the ratio types and units from std, whereas mine is.
&ddagger;The tick factor is retrieved using the ::sysconf() call you suggest, and that is guaranteed to return one and the same value throughout the life time of the process.
So the way you use it is as follows:
#include "util/proctime.hpp"
#include <ratio>
#include <chrono>
#include <thread>
#include <utility>
#include <iostream>
int main()
{
using std::chrono::duration_cast;
using millisec = std::chrono::milliseconds;
using clock_type = rlxutil::combined_clock<std::micro>;
auto tp1 = clock_type::now();
/* Perform some random calculations. */
unsigned long step1 = 1;
unsigned long step2 = 1;
for (int i = 0 ; i < 50000000 ; ++i) {
unsigned long step3 = step1 + step2;
std::swap(step1,step2);
std::swap(step2,step3);
}
/* Sleep for a while (this adds to real time, but not CPU time). */
std::this_thread::sleep_for(millisec(1000));
auto tp2 = clock_type::now();
std::cout << "Elapsed time: "
<< duration_cast<millisec>(tp2 - tp1)
<< std::endl;
return 0;
}
The usage above involves a pretty-print function that generates output like this:
Elapsed time: [user 40, system 0, real 1070 millisec]

C++ fine granular time

The following piece of code gives 0 as runtime of the function. Can anybody point out the error?
struct timeval start,end;
long seconds,useconds;
gettimeofday(&start, NULL);
int optimalpfs=optimal(n,ref,count);
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
cout<<"\nOptimal Runtime is "<<opt_runtime<<"\n";
I get both start and end time as the same.I get the following output
Optimal Runtime is 0
Tell me the error please.

POSIX 1003.1b-1993 specifies interfaces for clock_gettime() (and clock_getres()), and offers that with the MON option there can be a type of clock with a clockid_t value of CLOCK_MONOTONIC (so that your timer isn't affected by system time adjustments). If available on your system then these functions return a structure which has potential resolution down to one nanosecond, though the latter function will tell you exactly what resolution the clock has.
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* and nanoseconds */
};
You may still need to run your test function in a loop many times for the clock to register any time elapsed beyond its resolution, and perhaps you'll want to run your loop enough times to last at least an order of magnitude more time than the clock's resolution.
Note though that apparently the Linux folks mis-read the POSIX.1b specifications and/or didn't understand the definition of a monotonically increasing time clock, and their CLOCK_MONOTONIC clock is affected by system time adjustments, so you have to use their invented non-standard CLOCK_MONOTONIC_RAW clock to get a real monotonic time clock.
Alternately one could use the related POSIX.1 timer_settime() call to set a timer running, a signal handler to catch the signal delivered by the timer, and timer_getoverrun() to find out how much time elapsed between the queuing of the signal and its final delivery, and then set your loop to run until the timer goes off, counting the number of iterations in the time interval that was set, plus the overrun.
Of course on a preemptive multi-tasking system these clocks and timers will run even while your process is not running, so they are not really very useful for benchmarking.
Slightly more rare is the optional POSIX.1-1999 clockid_t value of CLOCK_PROCESS_CPUTIME_ID, indicated by the presence of the _POSIX_CPUTIME from <time.h>, which represents the CPU-time clock of the calling process, giving values representing the amount of execution time of the invoking process. (Even more rare is the TCT option of clockid_t of CLOCK_THREAD_CPUTIME_ID, indicated by the _POSIX_THREAD_CPUTIME macro, which represents the CPU time clock, giving values representing the amount of execution time of the invoking thread.)
Unfortunately POSIX makes no mention of whether these so-called CPUTIME clocks count just user time, or both user and system (and interrupt) time, accumulated by the process or thread, so if your code under profiling makes any system calls then the amount of time spent in kernel mode may, or may not, be represented.
Even worse, on multi-processor systems, the values of the CPUTIME clocks may be completely bogus if your process happens to migrate from one CPU to another during its execution. The timers implementing these CPUTIME clocks may also run at different speeds on different CPU cores, and at different times, further complicating what they mean. I.e. they may not mean anything related to real wall-clock time, but only be an indication of the number of CPU cycles (which may still be useful for benchmarking so long as relative times are always used and the user is aware that execution time may vary depending on external factors). Even worse it has been reported that on Linux CPU TimeStampCounter-based CPUTIME clocks may even report the time that a process has slept.
If your system has a good working getrusage() system call then it will hopefully be able to give you a struct timeval for each of the the actual user and system times separately consumed by your process while it was running. However since this puts you back to a microsecond clock at best then you'll need to run your test code enough times repeatedly to get a more accurate timing, calling getrusage() once before the loop, and again afterwards, and the calculating the differences between the times given. For simple algorithms this might mean running them millions of times, or more. Note also that on many systems the division between user time and system time is done somewhat arbitrarily and if examined separately in a repeated loop one or the other can even appear to run backwards. However if your algorithm makes no system calls then summing the time deltas should still be a fair total time for your code execution.
BTW, take care when comparing time values such that you don't overflow or end up with a negative value in a field, either as #Nim suggests, or perhaps like this (from NetBSD's <sys/time.h>):
#define timersub(tvp, uvp, vvp) \
do { \
(vvp)->tv_sec = (tvp)->tv_sec - (uvp)->tv_sec; \
(vvp)->tv_usec = (tvp)->tv_usec - (uvp)->tv_usec; \
if ((vvp)->tv_usec < 0) { \
(vvp)->tv_sec--; \
(vvp)->tv_usec += 1000000; \
} \
} while (0)
(you might even want to be more paranoid that tv_usec is in range)
One more important note about benchmarking: make sure your function is actually being called, ideally by examining the assembly output from your compiler. Compiling your function in a separate source module from the driver loop usually convinces the optimizer to keep the call. Another trick is to have it return a value that you assign inside the loop to a variable defined as volatile.

You've got weird mix of floats and ints here:
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
Try using:
long opt_runtime = (long)(seconds * 1000 + (float)useconds/1000);
This way you'll get your results in milliseconds.

The execution time of optimal(...) is less than the granularity of gettimeofday(...). This likely happes on Windows. On Windows the typical granularity is up to 20 ms. I've answered a related gettimeofday(...) question here.
For Linux I asked How is the microsecond time of linux gettimeofday() obtained and what is its accuracy? and got a good result.
More information on how to obtain accurate timing is described in this SO answer.

I normally do such a calculation as:
long long ss = start.tv_sec * 1000000LL + start.tv_usec;
long long es = end.tv_sec * 1000000LL + end.tv_usec;
Then do a difference
long long microsec_diff = es - ss;
Now convert as required:
double seconds = microsec_diff / 1000000.;
Normally, I don't bother with the last step, do all timings in microseconds.

QueryPerformanceCounter and overflows

I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.

QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.

Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call

Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.

Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.

The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js