C++ fine granular time - c++

The following piece of code gives 0 as runtime of the function. Can anybody point out the error?
struct timeval start,end;
long seconds,useconds;
gettimeofday(&start, NULL);
int optimalpfs=optimal(n,ref,count);
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
cout<<"\nOptimal Runtime is "<<opt_runtime<<"\n";
I get both start and end time as the same.I get the following output
Optimal Runtime is 0
Tell me the error please.

POSIX 1003.1b-1993 specifies interfaces for clock_gettime() (and clock_getres()), and offers that with the MON option there can be a type of clock with a clockid_t value of CLOCK_MONOTONIC (so that your timer isn't affected by system time adjustments). If available on your system then these functions return a structure which has potential resolution down to one nanosecond, though the latter function will tell you exactly what resolution the clock has.
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* and nanoseconds */
};
You may still need to run your test function in a loop many times for the clock to register any time elapsed beyond its resolution, and perhaps you'll want to run your loop enough times to last at least an order of magnitude more time than the clock's resolution.
Note though that apparently the Linux folks mis-read the POSIX.1b specifications and/or didn't understand the definition of a monotonically increasing time clock, and their CLOCK_MONOTONIC clock is affected by system time adjustments, so you have to use their invented non-standard CLOCK_MONOTONIC_RAW clock to get a real monotonic time clock.
Alternately one could use the related POSIX.1 timer_settime() call to set a timer running, a signal handler to catch the signal delivered by the timer, and timer_getoverrun() to find out how much time elapsed between the queuing of the signal and its final delivery, and then set your loop to run until the timer goes off, counting the number of iterations in the time interval that was set, plus the overrun.
Of course on a preemptive multi-tasking system these clocks and timers will run even while your process is not running, so they are not really very useful for benchmarking.
Slightly more rare is the optional POSIX.1-1999 clockid_t value of CLOCK_PROCESS_CPUTIME_ID, indicated by the presence of the _POSIX_CPUTIME from <time.h>, which represents the CPU-time clock of the calling process, giving values representing the amount of execution time of the invoking process. (Even more rare is the TCT option of clockid_t of CLOCK_THREAD_CPUTIME_ID, indicated by the _POSIX_THREAD_CPUTIME macro, which represents the CPU time clock, giving values representing the amount of execution time of the invoking thread.)
Unfortunately POSIX makes no mention of whether these so-called CPUTIME clocks count just user time, or both user and system (and interrupt) time, accumulated by the process or thread, so if your code under profiling makes any system calls then the amount of time spent in kernel mode may, or may not, be represented.
Even worse, on multi-processor systems, the values of the CPUTIME clocks may be completely bogus if your process happens to migrate from one CPU to another during its execution. The timers implementing these CPUTIME clocks may also run at different speeds on different CPU cores, and at different times, further complicating what they mean. I.e. they may not mean anything related to real wall-clock time, but only be an indication of the number of CPU cycles (which may still be useful for benchmarking so long as relative times are always used and the user is aware that execution time may vary depending on external factors). Even worse it has been reported that on Linux CPU TimeStampCounter-based CPUTIME clocks may even report the time that a process has slept.
If your system has a good working getrusage() system call then it will hopefully be able to give you a struct timeval for each of the the actual user and system times separately consumed by your process while it was running. However since this puts you back to a microsecond clock at best then you'll need to run your test code enough times repeatedly to get a more accurate timing, calling getrusage() once before the loop, and again afterwards, and the calculating the differences between the times given. For simple algorithms this might mean running them millions of times, or more. Note also that on many systems the division between user time and system time is done somewhat arbitrarily and if examined separately in a repeated loop one or the other can even appear to run backwards. However if your algorithm makes no system calls then summing the time deltas should still be a fair total time for your code execution.
BTW, take care when comparing time values such that you don't overflow or end up with a negative value in a field, either as #Nim suggests, or perhaps like this (from NetBSD's <sys/time.h>):
#define timersub(tvp, uvp, vvp) \
do { \
(vvp)->tv_sec = (tvp)->tv_sec - (uvp)->tv_sec; \
(vvp)->tv_usec = (tvp)->tv_usec - (uvp)->tv_usec; \
if ((vvp)->tv_usec < 0) { \
(vvp)->tv_sec--; \
(vvp)->tv_usec += 1000000; \
} \
} while (0)
(you might even want to be more paranoid that tv_usec is in range)
One more important note about benchmarking: make sure your function is actually being called, ideally by examining the assembly output from your compiler. Compiling your function in a separate source module from the driver loop usually convinces the optimizer to keep the call. Another trick is to have it return a value that you assign inside the loop to a variable defined as volatile.

You've got weird mix of floats and ints here:
long opt_runtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
Try using:
long opt_runtime = (long)(seconds * 1000 + (float)useconds/1000);
This way you'll get your results in milliseconds.

The execution time of optimal(...) is less than the granularity of gettimeofday(...). This likely happes on Windows. On Windows the typical granularity is up to 20 ms. I've answered a related gettimeofday(...) question here.
For Linux I asked How is the microsecond time of linux gettimeofday() obtained and what is its accuracy? and got a good result.
More information on how to obtain accurate timing is described in this SO answer.

I normally do such a calculation as:
long long ss = start.tv_sec * 1000000LL + start.tv_usec;
long long es = end.tv_sec * 1000000LL + end.tv_usec;
Then do a difference
long long microsec_diff = es - ss;
Now convert as required:
double seconds = microsec_diff / 1000000.;
Normally, I don't bother with the last step, do all timings in microseconds.

Related

why clock() does not work on the the cluster machine

I want to get the running time of part of my code.
my C++ code is like:
...
time_t t1 = clock();
/*
Here is my core code.
*/
time_t t2 = clock();
cout <<"Running time: "<< (1000.0 * (t2 - t1)) / CLOCKS_PER_SEC << "ms" << endl;
...
This code works well on my laptop.(Opensuse,g++ and clang++, Core i5).
But it does not work well on the cluster in the department.
(Ubuntu, g++, amd Opteron and intel Xeon)
I always get some integer running time :
like : 0ms or 10ms or 20ms.
What cause that ? Why? Thanks!
Clocks are not guaranteed to be exact down to ~10-44 seconds (Planck time), they often have a minimal resolution. The Linux man page implies this with:
The clock() function returns an approximation of processor time used by the program.
and so does the ISO standard C11 7.27.2.1 The clock function /3:
The clock function returns the implementation’s best approximation of ...
and in 7.27.1 Components of time /4:
The range and precision of times representable in clock_t and time_t are implementation-defined.
From your (admittedly limited) sample data, it looks like the minimum resolution of your cluster machines is on the order of 10ms.
In any case, you have several possibilities if you need a finer resolution.
First, find a (probably implementation-specific) means of timing things more accurately.
Second, don't do it once. Do it a thousand times in a tight loop and then just divide the time taken by 1000. That should roughly increase your resolution a thousand-fold.
Thirdly, think about the implication that your code only takes 50ms at the outside. Unless you have a pressing need to execute it more than twenty times a second (assuming you have no other code to run), it may not be an issue.
On that last point, think of things like "What's the longest a user will have to wait before they get annoyed?". The answer to that would vary but half a second might be fine in most situations.
Since 50ms code could run ten times over during that time, you may want to ignore it. You'd be better off concentrating on code that has a clearly larger impact.

Why is the CPU time different with every execution of this program?

I have a hard time understanding processor time. The result of this program:
#include <iostream>
#include <chrono>
// the function f() does some time-consuming work
void f()
{
volatile long double d;
int size = 10000;
for(int n=0; n<size; ++n)
for(int m=0; m<size; ++m)
d = n*m;
}
int main()
{
std::clock_t start = std::clock();
f();
std::clock_t end = std::clock();
std::cout << "CPU time used: "
<< (end - start)
<< "\n";
}
Seems to randomly fluctuate between 210 000, 220 000 and 230 000. At first I was amazed, why these discrete values. Then I found out that std::clock() returns only approximate processor time. So probably the value returned by std::clock() is rounded to a multiple of 10 000. This would also explain why the maximum difference between the CPU times is 20 000 (10 000 == rounding error by the first call to std::clock() and 10 000 by the second).
But if I change to int size = 40000; in the body of f(), I get fluctuations in the ranges of 3 400 000 to 3 500 000 which cannot be explained by rounding.
From what I read about the clock rate, on Wikipedia:
The CPU requires a fixed number of clock ticks (or clock cycles) to
execute each instruction. The faster the clock, the more instructions
the CPU can execute per second.
That is, if the program is deterministic (which I hope mine is), the CPU time needed to finish should be:
Always the same
Slightly higher than the number of instructions carried out
My experiments show neither, since my program needs to carry out at least 3 * size * size instructions. Could you please explain what I am doing wrong?
First, the statement you quote from Wikipedia is simply false.
It might have been true 20 years ago (but not always, even
then), but it is totally false today. There are many things
which can affect your timings:
The first: if you're running on Windows, clock is broken,
and totally unreliable. It returns the difference in elapsed
time, not CPU time. And elapsed time depends on all sorts of
other things the processor might be doing.
Beyond that: things like cache misses have a very significant
impact on time. And whether a particular piece of data is in
the cache or not can depend on whether your program was
interrupted between the last access and this one.
In general, anything less than 10% can easily be due to the
caching issues. And I've seen differences of a factor of 10
under Windows, depending on whether there was a build running or
not.
You don't state what hardware you're running the binary on.
Does it have an interrupt driven CPU ?
Is it a multitasking operating system ?
You're mistaking the cycle time of the CPU (the CPU clock as Wikipedia refers to) with the time it takes to execute a particular piece of code from start to end and all the other stuff the poor CPU has to do at the same time.
Also ... is all your executing code in level 1 cache, or is some in level 2 or in main memory, or on disk ... what about the next time you run it ?
Your program is not deterministic, because it uses library and system functions which are not deterministic.
As a particular example, when you allocate memory this is virtual memory, which must be mapped to physical memory. Although this is a system call, running kernel code, it takes place on your thread and will count against your clock time. How long it takes to do this will depend on what the overall memory allocation situation is.
The CPU time is indeed "fixed" for a given set of circumstances. However, in a modern computer, there are other things happening in the system, which interferes with the execution of your code. It may be that caches are being wiped out when your email software wakes up to check if there is any new emails for you, or when the HP printer software checks for updates, or when the antivirus software decides to run for a little bit checking if your memory contains any viruses, etc, etc, etc, etc.
Part of this is also caused by the problem that CPU time accounting in any system is not 100% accurate - it works on "clock-ticks" and similar things, so the time used by for example an interrupt to service a network packet coming in, or the hard disk servicing interrupt, or the timer interrupt to say "another millisecond ticked by" these all account into "the currently running process". Assuming this is Windows, there is a further "feature", and that is that for historical and other reasons, std::clock() simply returns the time now, not actually the time used by your process. So for exampple:
t = clock();
cin >> x;
t = clock() - t;
would leave t with a time of 10 seconds if it took ten seconds to input the value of x, even though 9.999 of those ten seconds were spent in the idle process, not your program.

Linux/c++ timing method to gaurentee execution every N seconds despite drift?

I have a program that needs to execute every X seconds to write some output. The program will do some intermittent polling and processing between each output. So for example I may be outputing every 5 seconds, but I wake up to poll every .1 seconds until I hit the next 5 second mark. The program in theory will run for months between restart, possible even longer.
I need the the execution at every X seconds to stay consistent to wall clock. In other words I can't allow clock drift to cause me to drift away from the X second mark. I don't need quite the same level of accuracy in the intermitent polling, but I would like to poll more often then a second so I need a structur that can represent sub-second percision.
I realize that by the very nature of running on an OS there wil be a certain inconsistency/latency in execution of any timer such that I can't gaurentee that I'll execute at exactly ever x seconds. But that is fine so long at it stays a small normal distribution, what I don't want is for drift to allow the time to consistently get further and further off.
I would also prefer to try to minimize the CPU cost of the polling as much as possible, but that's a secondary concern.
What timing constructs are available for Linux that can best provide me this level of percision? I'm trying to avoid including boost in the application due to hassles in distribution, but can use it if I have to. So methods using the standard c++ libraries are prefered, but if Bosst can do it better I would like to know that as well.
Thank you.
-ps, I can't use c++11. It's not an option so there is no way I can use any constructs from it.
clock_gettime and clock_nanosleep both operate on sub-second times, and use of CLOCK_MONOTONIC should prevent any skew due to adjustments to system time (such as NTP or adjtimex). For example,
long delay_ns = 250000000;
struct timespec next;
clock_gettime(CLOCK_MONOTONIC, &next);
while (1) {
next.ts_nsec += delay_ns;
next.ts_sec += ts_nsec / 1000000000;
next.ts_nsec %= 1000000000;
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &next, NULL);
/* do something after the wait */
}
In reality, you should check whether you've returned early due to a signal and whether you should skip an interval because too much time passed while you were sleeping.
Other methods of sleeping, such as nanosleep and select, only allow for specifying a time interval and use CLOCK_REALTIME, which is system time and may change.
If you need some exact timing without jitter or clock drift, make sure to have an external time source like a GPS, atom clock or radio time clock.
See for example the NTP integration documented in this PDF.
All NTP tricks work as well on Linux as on NanoBSD (which is in the PDF).
I would record the time t_0 when I started, then at any given time t, if i = (t - t_0)/X (integer division) is greater than the same quantity last time it executed, you execute again.
For example, suppose you are running every 5 seconds, if you started at time 23 you record i = 0, t_0 = 23. Then at time 27, (27 - 23)/5 = 0, so you don't yet execute. But next time around when it comes to time 28, (28 - 23)/5 = 1, which is greater than 0, so you execute.
This way you're not adding to a counter each tick which could have a huge drift, you're computing absolutely the time since the start so you know the exact times at which to execute.

timers, threads and compiler misbehaviour

I'm having trouble with something and couldn't find any answers about it, as I don't even know what to search for. I have a done a timer class using QueryPerformanceCounter, from my application, I launch a second thread object that has its own instanced timer and I just have an infinite loop getting delta time from the timer and using it to output the number of loop iterations per second.
I've noticed that it was giving me weird values so I started printing delta time and found out it was coming as 0 sometimes, so I went inside the method that returns delta time and did some testing. This is my deltaTime() method:
double MyTimer2::deltaTime()
{
LARGE_INTEGER timenow;
QueryPerformanceCounter(&timenow);
//std::cout << "timenow=" << (double)timenow.QuadPart << " currentticks=" << (double)m_currentTicks.QuadPart << std::endl;
double m_deltaTime = (double)(timenow.QuadPart - m_currentTicks.QuadPart) /* 1000.0*/ / (double)m_frequency.QuadPart;
m_currentTicks = timenow;
if(m_deltaTime < 0.000001)
return 0.0;
return m_deltaTime;
}
So, I put a breakpoint on "return 0.0;" and what happens is that it gets there most of the time, which is not correct. However, if I uncomment the printing code and run, I will never stop on the breakpoint. So in theory, my printing code is making it work correctly, whereas if I remove it, things stop working as they should! How is this possible, why is it happening and how can I fix it? I've tried _ReadWriteBarrier() unsuccessfully.
Thanks in advance!
EDIT: I need a high-resolution timer for physics simulation!
A couple processor generations ago, QueryPerformanceCounter() would read the CPU's cycle counter (e.g. rdtsc). Using this method, the number of ticks from successive reads would never be zero. The resolution was equal to the CPU clock rate, e.g. 3 GHz.
Modern processors have two characteristics which make the cycle counter useless for timing. First, you have multiple cores, which each have their own cycle counter. Threads can migrate between cores, and if you read the cycle counter from two different cores, the difference would not be related to elapsed time. It could even be negative. Secondly, you have dynamic clocking based on load (both underclocking to save power and overclocking for performance). Intel calls these "SpeedStep" and "Turbo Boost", respectively. When the cycle rate isn't fixed, there's no way to convert from ticks to time.
So, QueryPerformanceCounter now uses a dedicated piece of hardware called a High-Performance Event Counter (HPET), with a resolution of several MHz. Importantly, there's only one regardless of how many cores you have, and it doesn't change speed dynamically. But, since the resolution is lower, it is now possible to read it twice between ticks, in which case you'll get an elapsed time reported as zero.
In practice, this isn't a problem. If you need timing more precise than what the HPET can provide, then a general purpose computer is not suitable for you. Timing in the nanosecond range will be severely affected by interrupts.
What could possibly be the purpose of this block?
if(m_deltaTime < 0.000001)
return 0.0;
It has no value, it simply screws with the results, telling you the time was zero when it actually wasn't.
First of all, your timer is wrong: it consumes your CPU intensively. On the single core machine it will slow down all the system. If you want to create a timer and target Windows, you can use timer functions.
Then, every not negative value, returned by your deltaTime() function is valid. While you hosted not in real-time operating system, every operation can take arbitrary amount of time. One iteration can take about tens cycles of processor ticks, or tens years. No one guarantee.
Third, about experimental results. It seems that if context will be switched once between two consecutive time measurement, you get value about 0.016s, if not, you get value bellow 0.000001s that is floored to 0s.
As it was said, printing to console is relatively heavy operation and you actually always get context switched when you enable it.
EDIT
While QueryPerformanceCounter seems to offer great resolution, it traps you. You will never get actually high resolution timer, unless you work in real-time OS.

QueryPerformanceCounter and overflows

I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.
QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.
Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call
Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.
Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.
The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.