Getting CPU time in Ubuntu?

Getting CPU time in Ubuntu? - c++

I have a question on measuring CPU time on Ubuntu 12.04.
I want to use the CPU time as a stopping criteria in some loop. What is the best way?
while(....)
{
//main part
// get CPUTIME
if(CPUTIME>= Given_Time)
{
break;
}
}
1) I can use clock(), if I don't care too much about the resolution of time.
time_t begin=clock();
....
time_t end=clock();
CPUTIME=(double)(end-begin)/(double)(CLOCKS_PER_SEC);
However, time_t many be overflow, since the running time may be long(more than an hour). How can I fix this problem?
2) The second option is using getrusage(int who, struct rusage *usage)
Does it cost too much to call this function in the loop?
3) The third option is to use int clock_gettime(clockid_t clk_id, struct timespect *tp)
So far, this is the best choice in my option.
Any suggestions and comments will be helpful.

Depending on how fast your loop is running, you might not want to call the checking function every cycle. Perhaps add a loop counter and check the CPU time only if, for example, (i % 1000) == 0.
To answer your question: I'd personally use clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...) because librt, the library in which this function resides, was made specifically for tasks like this. However, there's no reason that you can't use the other two; you can easily detect the overflow with clock() by detecting when the value wraps and having a separate counter that counts the number of times it wraps.

Related

Correct QueryPerformanceCounter Function implementation / Time changes everytime

I have to create a sorting algorithm function that returns number of comparisons, number of copies and number of MICROSECONDS it uses to finish its sorting.
I have seen that to use microseconds i have to use the function QueryPerformance counter as it's accurate (Ps i know it isn't portable between OS)
So i've done that :
void Exchange_sort(int vect[], int dim, int &countconf, int &countcopy, double &time)
{
LARGE_INTEGER a, b, oh, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&a);
QueryPerformanceCounter(&b);
oh.QuadPart = b.QuadPart - a.QuadPart; //Saves in oh the overhead time (?) accuracy
QueryPerformanceCounter(&a);
int i=0,j=0; // The sorting algorithm starts
for (i=0 ; i<dim-1 ; i++)
{ for(j=i+1 ; j<dim; j++ )
{
countconf++; // +1 Comparisons
if (vect[i]>vect[j])
{
scambio ( vect[i],vect[j] ); // It is a function that swaps 2 integers
countcopy=countcopy+3; // +3 copies
}
}
}
QueryPerformanceCounter(&b); // Ends timer
time = ( ( (double)(b.QuadPart - a.QuadPart - oh.QuadPart) /freq.QuadPart )
*1000000 ) ;
}
The *1000000 is actually to give microseconds...
I think like this it should work but everytime i call the function giving it the same dimension of the array, it returns a different time... How can i solve that?
Thank you very much, and sorry for my bad coding

Firstly, the performance counter frequency might not be that great. It's usually several hundred thousand or more, which gives a microsecond or tens of microseconds resolution, but you should be aware that it can be even worse.
Secondly, if your array size is small, your sort might finish in nanoseconds or microseconds, and you would not be able to measure that accurately with QueryPerformanceCounter.
Thirdly, when your benchmark process is running, Windows might take the CPU away from it for a (relatively) long time, milliseconds or maybe even hundreds of milliseconds. This will lead to highly irregular and seemingly erratic timings.
I have two suggestions that you might pursue independently of each other:
I suggest you investigate using the RDTSC instruction (using inline assembly or compiler intrinsics or even an existing library.) Which will most likely give you better resolution with far less overhead. But I have to warn you that it has its own bag of problems.
For this type of benchmark, you have to run your sort routine with the exact same input many times (tens or hundreds) and then take the smallest time measurement. The reason that you should adopt this strategy is that there are a few phenomena that will interfere with your timing and make it longer, but there is nothing that can make your sort go faster than it would on paper. Therefore, you need to run the test many many times and hope to all your gods that the fastest time you've measured is the actual running time with no interference or noise.
UPDATE: Reading through the comments on the question, it seems that you are trying to time a very short-running piece of code with a timer that doesn't have enough resolution. Either increase your input size, or use RDTSC.

The short answer for your question is that it is not possible to measure exactly the same time for all calls of the same function.
The fact that you are receiving different times is expected because your operating system is not a perfect Real-Time System, but a general purpose OS with multiple processes running at the same time and competing to be scheduled by the kernel to get its own CPU cycles.
And also, consider that, each time you execute your program or function, some of its instructions might be located at the RAM, and some might be available at the CPU L1 or L2 cache memory, and it will probably change from one execution to another. So, there are lots of variables to consider when evaluating the elapsed time for function calls using high level of precision.

C++ , Timer, Milliseconds

#include <iostream>
#include <conio.h>
#include <ctime>
using namespace std;
double diffclock(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks)/(CLOCKS_PER_SEC/1000);
return diffms;
}
int main()
{
clock_t start = clock();
for(int i=0;;i++)
{
if(i==10000)break;
}
clock_t end = clock();
cout << diffclock(start,end)<<endl;
getch();
return 0;
}
So my problems comes to that it returns me a 0, well to be stright i want to check how much time my program does operate...
I found tons of crap over the internet well mostly it comes to the same point of getting a 0 beacuse the start and the end is the same
This problems goes to C++ remeber : <

There are a few problems in here. The first is that you obviously switched start and stop time when passing to diffclock() function. The second problem is optimization. Any reasonably smart compiler with optimizations enabled would simply throw the entire loop away as it does not have any side effects. But even you fix the above problems, the program would most likely still print 0. If you try to imagine doing billions operations per second, throw sophisticated out of order execution, prediction and tons of other technologies employed by modern CPUs, even a CPU may optimize your loop away. But even if it doesn't, you'd need a lot more than 10K iterations in order to make it run longer. You'd probably need your program to run for a second or two in order to get clock() reflect anything.
But the most important problem is clock() itself. That function is not suitable for any time of performance measurements whatsoever. What it does is gives you an approximation of processor time used by the program. Aside of vague nature of the approximation method that might be used by any given implementation (since standard doesn't require it of anything specific), POSIX standard also requires CLOCKS_PER_SEC to be equal to 1000000 independent of the actual resolution. In other words — it doesn't matter how precise the clock is, it doesn't matter at what frequency your CPU is running. To put simply — it is a totally useless number and therefore a totally useless function. The only reason why it still exists is probably for historical reasons. So, please do not use it.
To achieve what you are looking for, people have used to read the CPU Time Stamp also known as "RDTSC" by the name of the corresponding CPU instruction used to read it. These days, however, this is also mostly useless because:
Modern operating systems can easily migrate the program from one CPU to another. You can imagine that reading time stamp on another CPU after running for a second on another doesn't make a lot of sense. It is only in latest Intel CPUs the counter is synchronized across CPU cores. All in all, it is still possible to do this, but a lot of extra care must be taken (i.e. once can setup the affinity for the process, etc. etc).
Measuring CPU instructions of the program oftentimes does not give an accurate picture of how much time it is actually using. This is because in real programs there could be some system calls where the work is performed by the OS kernel on behalf of the process. In that case, that time is not included.
It could also happen that OS suspends an execution of the process for a long time. And even though it took only a few instructions to execute, for user it seemed like a second. So such a performance measurement may be useless.
So what to do?
When it comes to profiling, a tool like perf must be used. It can track a number of CPU clocks, cache misses, branches taken, branches missed, a number of times the process was moved from one CPU to another, and so on. It can be used as a tool, or can be embedded into your application (something like PAPI).
And if the question is about actual time spent, people use a wall clock. Preferably, a high-precision one, that is also not a subject to NTP adjustments (monotonic). That shows exactly how much time elapsed, no matter what was going on. For that purpose clock_gettime() can be used. It is part of SUSv2, POSIX.1-2001 standard. Given that use you getch() to keep the terminal open, I'd assume you are using Windows. There, unfortunately, you don't have clock_gettime() and the closest thing would be performance counters API:
BOOL QueryPerformanceFrequency(LARGE_INTEGER *lpFrequency);
BOOL QueryPerformanceCounter(LARGE_INTEGER *lpPerformanceCount);
For a portable solution, the best bet is on std::chrono::high_resolution_clock(). It was introduced in C++11, but is supported by most industrial grade compilers (GCC, Clang, MSVC).
Below is an example of how to use it. Please note that since I know that my CPU will do 10000 increments of an integer way faster than a millisecond, I have changed it to microseconds. I've also declared the counter as volatile in hope that compiler won't optimize it away.
#include <ctime>
#include <chrono>
#include <iostream>
int main()
{
volatile int i = 0; // "volatile" is to ask compiler not to optimize the loop away.
auto start = std::chrono::steady_clock::now();
while (i < 10000) {
++i;
}
auto end = std::chrono::steady_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "It took me " << elapsed.count() << " microseconds." << std::endl;
}
When I compile and run it, it prints:
$ g++ -std=c++11 -Wall -o test ./test.cpp && ./test
It took me 23 microseconds.
Hope it helps. Good Luck!

At a glance, it seems like you are subtracting the larger value from the smaller value. You call:
diffclock( start, end );
But then diffclock is defined as:
double diffclock( clock_t clock1, clock_t clock2 ) {
double diffticks = clock1 - clock2;
double diffms = diffticks / ( CLOCKS_PER_SEC / 1000 );
return diffms;
}
Apart from that, it may have something to do with the way you are converting units. The use of 1000 to convert to milliseconds is different on this page:
http://en.cppreference.com/w/cpp/chrono/c/clock

The problem appears to be the loop is just too short. I tried it on my system and it gave 0 ticks. I checked what diffticks was and it was 0. Increasing the loop size to 100000000, so there was a noticeable time lag and I got -290 as output (bug -- I think that the diffticks should be clock2-clock1 so we should get 290 and not -290). I tried also changing "1000" to "1000.0" in the division and that didn't work.
Compiling with optimization does remove the loop, so you have to not use it, or make the loop "do something", e.g. increment a counter other than the loop counter in the loop body. At least that's what GCC does.

Note: This is available after c++11.
You can use std::chrono library.
std::chrono has two distinct objects. (timepoint and duration). Timepoint represents a point in time, and duration, as we already know the term represents an interval or a span of time.
This c++ library allows us to subtract two timepoints to get a duration of time passed in the interval. So you can set a starting point and a stopping point. Using functions you can also convert them into appropriate units.
Example using high_resolution_clock (which is one of the three clocks this library provides):
#include <chrono>
using namespace std::chrono;
//before running function
auto start = high_resolution_clock::now();
//after calling function
auto stop = high_resolution_clock::now();
Subtract stop and start timepoints and cast it into required units using the duration_cast() function. Predefined units are nanoseconds, microseconds, milliseconds, seconds, minutes, and hours.
auto duration = duration_cast<microseconds>(stop - start);
cout << duration.count() << endl;

First of all you should subtract end - start not vice versa.
Documentation says if value is not available clock() returns -1, did you check that?
What optimization level do you use when compile your program? If optimization is enabled compiler can effectively eliminate your loop entirely.

QueryPerformanceCounter and overflows

I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.

QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.

Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call

Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.

Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.

The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.

clock() vs getsystemtime()

I developed a class for calculations on multithreads and only one instance of this class is used by a thread. Also I want to measure the duration of calculations by iterating over a container of this class from another thread. The application is win32. The thing is I have read QueryPerformanceCounter is useful when comparing the measuremnts on a single thread. Because I can not use it my problem, I think of clock() or GetSystemTime(). It is sad that both methods have a 'resolution' of milliseconds (since CLOCKS_PER_SEC is 1000 on win32). Which method should I use or to generalize, is there a better option for me?
As a rule I have to take the measurements outside the working thread.
Here is some code as an example.
unsinged long GetCounter()
{
SYSTEMTIME ww;
GetSystemTime(&ww);
return ww.wMilliseconds + 1000 * ww.wSeconds;
// or
return clock();
}
class WorkClass
{
bool is_working;
unsigned long counter;
HANDLE threadHandle;
public:
DoWork()
{
threadHandle = GetCurrentThread();
is_working = true;
counter = GetCounter();
// Do some work
is_working = false;
}
};
void CheckDurations() // will work on another thread;
{
for(size_t i =0;i < vector_of_workClass.size(); ++i)
{
WorkClass & wc = vector_of_workClass[i];
if(wc.is_working)
{
unsigned long dur = GetCounter() - wc.counter;
ReportDuration(wc,dur);
if( dur > someLimitValue)
TerminateThread(wc.threadHandle);
}
}
}

QueryPerformanceCounter is fine for multithreaded applications. The processor instruction that may be used (rdtsc) can potentially provide invalid results when called on different processors.
I recommend reading "Game Timing and Multicore Processors".
For your specific application, the problem it appears you are trying to solve is using a timeout on some potentially long-running threads. The proper solution to this would be to use the WaitForMultipleObjects function with a timeout value. If the time expires, then you can terminate any threads that are still running - ideally by setting a flag that each thread checks, but TerminateThread may be suitable.

both methods have a precision of milliseconds
They don't. They have a resolution of a millisecond, the precision is far worse. Most machines increment the value only at intervals of 15.625 msec. That's a heckofalot of CPU cycles, usually not good enough to get any reliable indicator of code efficiency.
QPF does much better, no idea why you couldn't use it. A profiler is a the standard tool to measure code efficiency. Beats taking dependencies you don't want.

QueryPerformanceCounter should give you the best precision, but there is issues when the function get run on different processors (you get a different result for each processor). So when running in a thread you will experience shifts when the thread switch processor. To solve this you can set processor affinity for the thread that measures time.

GetSystemTime gets an absolute time, clock is a relative time but both measure elapsed time, not CPU time related to the actual thread/process.
Of course clock() is more portable. Having said that I use clock_gettime on Linux because I can get both elapsed and thread CPU time with that call.
boost has some time functions that you could use that will run on multiple platforms if you want platform independent code.

optimize time(NULL) call in c++

I have a system that spend 66% of its time in a time(NULL) call.
It there a way to cache or optimize this call?
Context: I'm playing with Protothread for c++. Trying to simulate threads with state machines. So Therefore I cant use native threads.
Here's the header:
#ifndef __TIMER_H__
#define __TIMER_H__
#include <time.h>
#include <iostream>
class Timer
{
private:
time_t initial;
public:
Timer();
unsigned long passed();
};
#endif
and the source file:
#include "Timer.h"
using namespace std;
Timer::Timer()
{
initial = time(NULL);
}
unsigned long Timer::passed()
{
time_t current = time(NULL);
return (current - initial);
}
UPDATE:
Final solution!
The cpu cycles it going away somewhere, and if I spend them being correct. That is
not so bad after all.
#define start_timer() timer_start=time(NULL)
#define timeout(x) ((time(NULL)-timer_start)>=x)

I presume you are calling it within some loop which is otherwise stonkingly efficient.
What you could do is keep a count of how many iterations your loop goes through before the return value of time changes.
Then don't call it again until you've gone through that many iterations again.
You can dynamically adjust this count upwards or downwards if you find you're going adrift, but you should be able to engineer it so that on average, it calls time() once per second.
Here's a rough idea of how you might do it (there's many variations on this theme)
int iterations_per_sec=10; //wild guess
int iterations=0;
while(looping)
{
//do the real work
//check our timing
if (++iterations>iterations_per_sec)
{
int t=time(NULL);
if (t==lasttime)
{
iterations_per_sec++;
}
else
{
iterations_per_sec=iterations/(t-lasttime);
iterations=0;
lastime=t;
//do whatever else you want to do on a per-second basis
}
}
}

That sounds quite much, given that time only has a precision of 1 second. Sounds like you call it way too often. One possible improvement would be to maybe call it only each 500ms. So it will still hit every second.
So instead of calling it 100 times a second, start off a timer that rings every 500ms, taking the current time and storing it into an integer. Then, read that integer 100 times a second instead.

As pointed out, you cannot cache it, as the whole point of time() is to give you the current time, which obviously changes all the time.
The real question however probably is: Why is the program calling time() so frequently? I can't think of any good reason to do so.
Is it polling time()? In that case sleep() might be more appropriate.

Call it less often - unless you really need the current time hundreds of times a second, you shouldn't be calling it that often.
EDIT:
After trying it, I'm even more curious, I realize you might be on a small embeded system, but on my system, I had no problems running 10,000,000 calls to time() in a second. You're likely doing something seriously wrong given that time() is only going to change once a second. What exactly are you trying to achieve?

If you're on Unix, you may consider using gettimeofday (http://www.opengroup.org/onlinepubs/000095399/functions/gettimeofday.html) - it's faster and has better precision.

Caching will not help, unless and until you don't want the current time. Can you post some code?

It really depends, but saving the result won't help if you always want the current time. time( NULL ) likely results in a system call, which will take time since you have to switch to/from kernel mode.
What you can do is read the tsc at the same time that you get the current time, then read the tsc again when you want to get the current time, and add the number of cycles/CPU speed to your time.
There are some answers about rdtsc on here that should help you.
Edit: see my answer in Timer to find elapsed time in a function call in C for more information about rdtsc.
Also note that I don't particularly recommend this unless you absolutely have to. It is highly likely that calling rdtsc, subtracting from the previous rdtsc converting that to a fractional equivalent in seconds by dividing by your cpu spped will be slower than just calling time() again.

Typically what you can do is save the result of time off into a local variable, and then use that as your current time until you perform some blocking call, or some long running CPU intensive section of code.
What are you doing that you need to call time this often and can you post some code?

You could create a thread which called time() a few times a second and then slept, updating a shared variable.
A quick skim of Protothread implied that it didn't use OS threads, so you might get away with no memory barriers. Otherwise something like an efficient read/write lock should mean it's negligible cost.

You could use a separate thread which would run an endless loop that would sleep() for 1 second (or less if you need finer granularity) and then update the timestamp value.
Other threads would just check this timestamp value without any performance penalty.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js