I have been looking at the performance of our C++ server application running on embedded Linux (ARM). The pseudo code for the main processing loop of the server is this -
for i = 1 to 1000
Process item i
Sleep for 20 ms
The processing for one item takes about 2ms. The "Sleep" here is really a call to the Poco library to do a "tryWait" on an event. If the event is fired (which it never is in my tests) or the time expires, it comes returns. I don't know what system call this equates to. Although we ask for a 2ms block, it turns out to be roughly 20ms. I can live with that - that's not the problem. The sleep is just an artificial delay so that other threads in the process are not starved.
The loop takes about 24 seconds to go through 1000 items.
The problem is, we changed the way the sleep is used so that we had a bit more control. I mean - 20ms delay for 2ms processing doesn't allow us to do much processing. With this new parameter set to a certain value it does something like this -
For i = 1 to 1000
Process item i
if i % 50 == 0 then sleep for 1000ms
That's the rough code, in reality the number of sleeps is slightly different and it happens to work out at a 24s cycle to get through all the items - just as before.
So we are doing exactly the same amount of processing in the same amount of time.
Problem 1 - the CPU usage for the original code is reported at around 1% (it varies a little but that's about average) and the CPU usage reported for the new code is about 5%. I think they should be the same.
Well perhaps this CPU reporting isn't accurate so I thought I'd sort a large text file at the same time and see how much it's slowed up by our server. This is a CPU bound process (98% CPU usage according to top). The results are very odd. With the old code, the time taken to sort the file goes up by 21% when our server is running.
Problem 2 - If the server is only using 1% of the CPU then wouldn't the time taken to do the sort be pretty much the same?
Also, the time taken to go through all the items doesn't change - it's still 24 seconds with or without the sort running.
Then I tried the new code, it only slows the sort down by about 12% but it now takes about 40% longer to get through all the items it has to process.
Problem 3 - Why do the two ways of introducing an artificial delay cause such different results. It seems that the server which sleeps more frequently but for a minimum time is getting more priority.
I have a half baked theory on the last one - whatever the system call that is used to do the "sleep" is switching back to the server process when the time is elapsed. This gives the process another bite at the time slice on a regular basis.
Any help appreciated. I suspect I'm just not understanding it correctly and that things are more complicated than I thought. I can provide more details if required.
Thanks.
Update: replaced tryWait(2) with usleep(2000) - no change. In fact, sched_yield() does the same.
Well I can at least answer problem 1 and problem 2 (as they are the same issue).
After trying out various options in the actual server code, we came to the conclusion that the CPU reporting from the OS is incorrect. It's quite result so to make sure, I wrote a stand alone program that doesn't use Poco or any of our code. Just plain Linux system calls and standard C++ features. It implements the pseudo code above. The processing is replaced with a tight loop just checking the elapsed time to see if 2ms is up. The sleeps are proper sleeps.
The small test program shows exactly the same problem. i.e. doing the same amount of processing but splitting up the way the sleep function is called, produces very different results for CPU usage. In the case of the test program, the reported CPU usage was 0.0078 seconds using 1000 20ms sleeps but 1.96875 when a less frequent 1000ms sleep was used. The amount of processing done is the same.
Running the test on a Linux PC did not show the problem. Both ways of sleeping produced exactly the same CPU usage.
So clearly a problem with our embedded system and the way it measures CPU time when a process is yielding so often (you get the same problem with sched_yeild instead of a sleep).
Update: Here's the code. RunLoop is where the main bit is done -
int sleepCount;
double getCPUTime( )
{
clockid_t id = CLOCK_PROCESS_CPUTIME_ID;
struct timespec ts;
if ( id != (clockid_t)-1 && clock_gettime( id, &ts ) != -1 )
return (double)ts.tv_sec +
(double)ts.tv_nsec / 1000000000.0;
return -1;
}
double GetElapsedMilliseconds(const timeval& startTime)
{
timeval endTime;
gettimeofday(&endTime, NULL);
double elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // us to ms
return elapsedTime;
}
void SleepMilliseconds(int milliseconds)
{
timeval startTime;
gettimeofday(&startTime, NULL);
usleep(milliseconds * 1000);
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > milliseconds + 0.3)
std::cout << "Sleep took longer than it should " << elapsedMilliseconds;
sleepCount++;
}
void DoSomeProcessingForAnItem()
{
timeval startTime;
gettimeofday(&startTime, NULL);
double processingTimeMilliseconds = 2.0;
double elapsedMilliseconds;
do
{
elapsedMilliseconds = GetElapsedMilliseconds(startTime);
} while (elapsedMilliseconds <= processingTimeMilliseconds);
if (elapsedMilliseconds > processingTimeMilliseconds + 0.1)
std::cout << "Processing took longer than it should " << elapsedMilliseconds;
}
void RunLoop(bool longSleep)
{
int numberOfItems = 1000;
timeval startTime;
gettimeofday(&startTime, NULL);
timeval startMainLoopTime;
gettimeofday(&startMainLoopTime, NULL);
for (int i = 0; i < numberOfItems; i++)
{
DoSomeProcessingForAnItem();
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > 100)
{
std::cout << "Item count = " << i << "\n";
if (longSleep)
{
SleepMilliseconds(1000);
}
gettimeofday(&startTime, NULL);
}
if (longSleep == false)
{
// Does 1000 * 20 ms sleeps.
SleepMilliseconds(20);
}
}
double elapsedMilliseconds = GetElapsedMilliseconds(startMainLoopTime);
std::cout << "Main loop took " << elapsedMilliseconds / 1000 <<" seconds\n";
}
void DoTest(bool longSleep)
{
timeval startTime;
gettimeofday(&startTime, NULL);
double startCPUtime = getCPUTime();
sleepCount = 0;
int runLoopCount = 1;
for (int i = 0; i < runLoopCount; i++)
{
RunLoop(longSleep);
std::cout << "**** Done one loop of processing ****\n";
}
double endCPUtime = getCPUTime();
std::cout << "Elapsed time is " <<GetElapsedMilliseconds(startTime) / 1000 << " seconds\n";
std::cout << "CPU time used is " << endCPUtime - startCPUtime << " seconds\n";
std::cout << "Sleep count " << sleepCount << "\n";
}
void testLong()
{
std::cout << "Running testLong\n";
DoTest(true);
}
void testShort()
{
std::cout << "Running testShort\n";
DoTest(false);
}
Related
I am trying to use a while loop to create a timer that consistently measures out 3000μs (3 ms) and while it works most of the time, other times the timer can be late by as much as 500μs. Why does this happen and is there a more precise way to make a timer like this?
int getTime() {
chrono::microseconds μs = chrono::duration_cast< chrono::microseconds >(
chrono::system_clock::now().time_since_epoch() //Get time since last epoch in μs
);
return μs.count(); //Return as integer
}
int main()
{
int target = 3000, difference = 0;
while (true) {
int start = getTime(), time = start;
while ((time-start) < target) {
time = getTime();
}
difference = time - start;
if (difference - target > 1) { //If the timer wasn't accurate to within 1μs
cout << "Timer missed the mark by " + to_string(difference - target) + " microseconds" << endl; //Log the delay
}
}
return 0;
}
I would expect this code to log delays that are consistently within 5 or so μs, but the console output looks like this.
Edit to clarify: I'm running on Windows 10 Enterprise Build 16299, but the behavior persists on a Debian virtual machine.
You need to also take into account other running processes. The operating system is likely preempting your process to give CPU time to those other processes/threads, and will non-deterministically return control to your process/thread running this timer.
Granted, this is not 100% true when we consider real-time operating systems or flat-schedulers. But this is likely the case in your code if you're running on a general purpose machine.
Since you are running on Windows, that RTOS is responsible for keeping time through NTP, as C++ has no built-in functions for it. Check out this Windows API for the SetTimer() function: http://msdn.microsoft.com/en-us/library/ms644906(v=vs.85).aspx.
If you want the best and most high-resolution clock through C++, check out the chrono library:
#include <iostream>
#include <chrono>
#include "chrono_io"
int main()
{
typedef std::chrono::high_resolution_clock Clock;
auto t1 = Clock::now();
auto t2 = Clock::now();
std::cout << t2-t1 << '\n';
}
I’m creating performance framework tool for measuring individual message processing time in CentOS 7. I reserved one CPU for this task with isolcpus kernel option and I run it using taskset.
Ok, now the problem. I trying to measure the max processing time among several messages. The processing time is <= 1000ns, but when I run many iterations I get very high results (> 10000ns).
Here I created some simple code which does nothing interesting but shows the problem. Depending on the number of iterations i can get results like:
max: 84 min: 23 -> for 1000 iterations
max: 68540 min: 11 -> for 100000000 iterations
I'm trying to understand from where this difference came from? I tried to run this with real-time scheduling with highest priority. Is there some way to prevent that?
#include <iostream>
#include <limits>
#include <time.h>
const unsigned long long SEC = 1000L*1000L*1000L;
inline int64_t time_difference( const timespec &start,
const timespec &stop ) {
return ( (stop.tv_sec * SEC - start.tv_sec * SEC) +
(stop.tv_nsec - start.tv_nsec));
}
int main()
{
timespec start, stop;
int64_t max = 0, min = std::numeric_limits<int64_t>::max();
for(int i = 0; i < 100000000; ++i){
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
int64_t time = time_difference(start, stop);
max = std::max(max, time);
min = std::min(min, time);
}
std::cout << "max: " << max << " min: " << min << std::endl;
}
You can't really reduce jitter to zero even with isolcpus, since you still have at least the following:
1) Interrupts delivered to your CPU (you may be able to reduce this my messing with irq affinity - but probably not to zero).
2) Clock timer interrupts are still scheduled for your process and may do a variable amount of work on the kernel side.
3) The CPU itself may pause briefly for P-state or C-state transitions, or other reasons (e.g., to let voltage levels settle after turning on AVX circuitry, etc).
Let us check the documentation...
Isolation will be effected for userspace processes - kernel threads may still get scheduled on the isolcpus isolated CPUs.
So it seems that there is no guarantee of perfect isolation, at least not from the kernel.
I am trying to create a function that will allow me to enter the desired frames per second and the maximum frame count and then have the function "cout" to the console on the fixed time steps. I am using Sleep() to avoid busy waiting as well. I seem to make the program sleep longer than it needs to because it keeps stalling on the sleep command i think. Can you help me with this? i am having some trouble understanding time, especially on windows.
Ultimately i will probably use this timing method to time and animate a simple game , maybe like pong, or even a simple program with objects that can accelerate. I think i already understand GDI and wasapi enough to play sound and show color on the screen, so now i need to understand timing. I have looked for a long time before asking this question on the internet and i am sure that i am missing something, but i can't quite put my finger on it :(
here is the code :
#include <windows.h>
#include <iostream>
// in this program i am trying to make a simple function that prints frame: and the number frame in between fixed time intervals
// i am trying to make it so that it doesn't do busy waiting
using namespace std;
void frame(LARGE_INTEGER& T, LARGE_INTEGER& T3, LARGE_INTEGER& DELT,LARGE_INTEGER& DESI, double& framepersec,unsigned long long& count,unsigned long long& maxcount,bool& on, LARGE_INTEGER& mili)
{
QueryPerformanceCounter(&T3); // seccond measurement
DELT.QuadPart = &T3.QuadPart - &T.QuadPart; // getting the ticks between the time measurements
if(DELT.QuadPart >= DESI.QuadPart) {count++; cout << "frame: " << count << " !" << endl; T.QuadPart = T3.QuadPart; } // adding to the count by just one frame (this may cause problems if more than one passes)
if(count > maxcount) {on = false;} // turning off the loop
else {DESI.QuadPart = T.QuadPart + DESI.QuadPart;//(long long)framepersec; // setting the stop tick
unsigned long long sleep = (( DESI.QuadPart - DELT.QuadPart) / mili.QuadPart);
cout << sleep << endl;
Sleep(sleep);} // sleeping to avoid busy waiting
}
int main()
{
LARGE_INTEGER T1, T2, Freq, Delta, desired, mil;
bool loopon = true; // keeps the loop flowing until max frames has been reached
QueryPerformanceFrequency(&Freq); // getting num of updates per second
mil.QuadPart = Freq.QuadPart / 1000; // getting the number clock updates that occur in a millisecond
double framespersec; // the number of clock updates that occur per target frame
unsigned long long framecount,maxcount; //to stop the program after a certain amount of frames
framecount = 0;
cout << "Hello world! enter the amount of frames per second : " << endl;
cin >> framespersec;
cout << "you entered: " << framespersec << " ! how many max frames?" << endl;
cin >> maxcount;
cout << "you entered: " << maxcount << " ! now doing the frames !!!" << endl;
desired.QuadPart = (Freq.QuadPart / framespersec);
while(loopon == true) {
frame(T1, T2, Delta, desired, framespersec, framecount, maxcount,loopon, mil);
}
cout << "all frames are done!" << endl;
return 0;
}
The time that you sleep is limited by the frequency of the system clock. The frequency defaults to 64 Hz, so you'll end up seeing sleeps in increments of 16ms. Any sleep that's less than 16ms will be at least 16ms long - it could be longer depending on CPU load. Likewise, a sleep of 20ms will likely be rounded up to 32ms.
You can change this period by calling timeBeginPeriod(...) and timeEndPeriod(...), which can increase sleep accuracy to 1ms. If you have a look at multimedia apps like VLC Player, you'll see that they use these functions to get reliable frame timing. Note that this changes the system wide scheduling rate, so it will affect battery life on laptops.
More info:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd757624%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298%28v=vs.85%29.aspx
Waitable timers are more accurate than Sleep, and also integrate with a GUI message loop better (replace GetMessage with MsgWaitForMultipleObjects). I've used them successfully for graphics timing before.
They won't get you high precision for e.g. controlling serial or network output at sub-millisecond timing, but UI updates are limited by VSYNC anyway.
Is there any way in C++ to calculate how long does it take to run a given program or routine in CPU time?
I work with Visual Studio 2008 running on Windows 7.
If you want to know the total amount of CPU time used by a process, neither clock nor rdtsc (either directly or via a compiler intrinsic) is really the best choice, at least IMO. If you need the code to be portable, about the best you can do is use clock, test with the system as quiescent as possible, and hope for the best (but if you do, be aware that the resolution of clock is CLOCKS_PER_SEC, which may or may not be 1000, and even if it is, your actual timing resolution often won't be that good -- it may give you times in milliseconds, but at least normally advance tens of milliseconds at a time).
Since, however, you don't seem to mind the code being specific to Windows, you can do quite a bit better. At least if my understanding of what you're looking for is correctly, what you really want is probably GetProcessTimes, which will (separately) tell you both kernel-mode and user-mode CPU usage of the process (as well as the start time and exit time, from which you can compute wall time used, if you care). There's also QueryProcessCycleTime, which will tell you the total number of CPU clock cycles used by the process (total of both user and kernel mode in all threads). Personally, I have a hard time imagining much use for the latter though -- counting individual clock cycles can be useful for small sections of code subject to intensive optimization, but I'm less certain about how you'd apply it to a complete process. GetProcessTimes uses FILETIME structures, which support resolutions of 100 nanoseconds, but in reality most times you'll see will be multiples of the scheduler's time slice (which varies with the version of windows, but is on the order of milliseconds to tens of milliseconds).
In any case, if you truly want time from beginning to end, GetProcessTimes will let you do that -- if you spawn the program (e.g., with CreateProcess), you'll get a handle to the process which will be signaled when the child process exits. You can then call GetProcessTimes on that handle, and retrieve the times even though the child has already exited -- the handle will remain valid as long as at least one handle to the process remains open.
Here's one way. It measures routine exeution time in milliseconds.
clock_t begin=clock(); starts before the route is executed and clock_t end=clock(); starts right after the routine exits.
The two time sets are then subtracted from each other and the result is a millisecod value.
#include <stdio.h>
#include <iostream>
#include <time.h>
using namespace std;
double get_CPU_time_usage(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks*1000)/CLOCKS_PER_SEC;
return diffms;
}
void test_CPU_usage()
{
cout << "Standby.. measuring exeution time: ";
for (int i=0; i<10000;i++)
{
cout << "\b\\" << std::flush;
cout << "\b|" << std::flush;
cout << "\b/" << std::flush;
cout << "\b-" << std::flush;
}
cout << " \n\n";
}
int main (void)
{
clock_t begin=clock();
test_CPU_usage();
clock_t end=clock();
cout << "Time elapsed: " << double(get_CPU_time_usage(end,begin)) << " ms ("<<double(get_CPU_time_usage(end,begin))/1000<<" sec) \n\n";
return 0;
}
The __rdtscp intrinsic will give you the time in CPU cycles with some caveats.
Here's the MSDN article
It depends really what you want to measure. For better results take the average of a few million (if not billion) iterations.
The clock() function [as provided by Visual C++ 2008] doesn't return processor time used by the program, while it should (according to the C standard and/or C++ standard). That said, to measure CPU time on Windows, I have this helper class (which is inevitably non-portable):
class ProcessorTimer
{
public:
ProcessorTimer() { start(); }
void start() { ::GetProcessTimes(::GetCurrentProcess(), &ft_[3], &ft_[2], &ft_[1], &ft_[0]); }
std::tuple<double, double> stop()
{
::GetProcessTimes(::GetCurrentProcess(), &ft_[5], &ft_[4], &ft_[3], &ft_[2]);
ULARGE_INTEGER u[4];
for (size_t i = 0; i < 4; ++i)
{
u[i].LowPart = ft_[i].dwLowDateTime;
u[i].HighPart = ft_[i].dwHighDateTime;
}
double user = (u[2].QuadPart - u[0].QuadPart) / 10000000.0;
double kernel = (u[3].QuadPart - u[1].QuadPart) / 10000000.0;
return std::make_tuple(user, kernel);
}
private:
FILETIME ft_[6];
};
class ScopedProcessorTimer
{
public:
ScopedProcessorTimer(std::ostream& os = std::cerr) : timer_(ProcessorTimer()), os_(os) { }
~ScopedProcessorTimer()
{
std::tuple<double, double> t = timer_.stop();
os_ << "user " << std::get<0>(t) << "\n";
os_ << "kernel " << std::get<1>(t) << "\n";
}
private:
ProcessorTimer timer_;
std::ostream& os_;
}
For example, one can measure how long it takes a block to execute, by defining a ScopedProcessorTimer at the beginning of that {} block.
This Code is Process Cpu Usage
ULONGLONG LastCycleTime = NULL;
LARGE_INTEGER LastPCounter;
LastPCounter.QuadPart = 0; // LARGE_INTEGER Init
// cpu get core number
SYSTEM_INFO sysInfo;
GetSystemInfo(&sysInfo);
int numProcessors = sysInfo.dwNumberOfProcessors;
HANDLE hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, FALSE, Process::pid);
if (hProcess == NULL)
nResult = 0;
int count = 0;
while (true)
{
ULONG64 CycleTime;
LARGE_INTEGER qpcLastInt;
if (!QueryProcessCycleTime(hProcess, &CycleTime))
nResult = 0;
ULONG64 cycle = CycleTime - LastCycleTime;
if (!QueryPerformanceCounter(&qpcLastInt))
nResult = 0;
double Usage = cycle / ((double)(qpcLastInt.QuadPart - LastPCounter.QuadPart));
// Scaling
Usage *= 1.0 / numProcessors;
Usage *= 0.1;
LastPCounter = qpcLastInt;
LastCycleTime = CycleTime;
if (count > 3)
{
printf("%.1f", Usage);
break;
}
Sleep(1); // QueryPerformanceCounter Function Resolution is 1 microsecond
count++;
}
CloseHandle(hProcess);
I have an experimental library whose performance I'm trying to measure. To do this, I've written the following:
struct timeval begin;
gettimeofday(&begin, NULL);
{
// Experiment!
}
struct timeval end;
gettimeofday(&end, NULL);
// Print the time it took!
std::cout << "Time: " << 100000 * (end.tv_sec - begin.tv_sec) + (end.tv_usec - begin.tv_usec) << std::endl;
Occasionally, my results include negative timings, some of which are nonsensical. For instance:
Time: 226762
Time: 220222
Time: 210883
Time: -688976
What's going on?
You've got a typo. Corrected last line (note the number of 0s):
std::cout << "Time: " << 1000000 * (end.tv_sec - begin.tv_sec) + (end.tv_usec - begin.tv_usec) << std::endl;
BTW, timersub is a built in method to get the difference between two timevals.
The posix realtime libraries are better suited for measurement of high accuracy intervals. You really don't want to know the current time. You just want to know how long it has been between two points. That is what the monotonic clock is for.
struct timespec begin;
clock_gettime( CLOCK_MONOTONIC, &begin );
{
// Experiment!
}
struct timespec end;
clock_gettime(CLOCK_MONOTONIC, &end );
// Print the time it took!
std::cout << "Time: " << double(end.tv_sec - begin.tv_sec) + (end.tv_nsec - begin.tv_nsec)/1000000000.0 << std::endl;
When you link you need to add -lrt.
Using the monotonic clock has several advantages. It often uses the hardware timers (Hz crystal or whatever), so it is often a faster call than gettimeofday(). Also monotonic timers are guaranteed to never go backwards even if ntpd or a user is goofing with the system time.
You took care of the negative value but it still isn't correct. The difference between the millisecond variables is erroneous, say we have begin and end times as 1.100s and 2.051s. By the accepted answer this would be an elapsed time of 1.049s which is incorrect.
The below code takes care of the cases where there is only a difference of milliseconds but not seconds and the case where the milliseconds value overflows.
if(end.tv_sec==begin.tv_sec)
printf("Total Time =%ldus\n",(end.tv_usec-begin.tv_usec));
else
printf("Total Time =%ldus\n",(end.tv_sec-begin.tv_sec-1)*1000000+(1000000-begin.tv_usec)+end.tv_usec);
std::cout << "Time: " << 100000 * (end.tv_sec - begin.tv_sec) + (end.tv_usec - begin.tv_usec) << std::endl;
As noted, there are 1000000 usec in a sec, not 100000.
More generally, you may need to be aware of the instability of timing on computers. Processes such as ntpd can change clock times, leading to incorrect delta times. You might be interested in POSIX facilities such as timer_create.
do
$ time ./proxy-application
next time