I am trying to create a function that will allow me to enter the desired frames per second and the maximum frame count and then have the function "cout" to the console on the fixed time steps. I am using Sleep() to avoid busy waiting as well. I seem to make the program sleep longer than it needs to because it keeps stalling on the sleep command i think. Can you help me with this? i am having some trouble understanding time, especially on windows.
Ultimately i will probably use this timing method to time and animate a simple game , maybe like pong, or even a simple program with objects that can accelerate. I think i already understand GDI and wasapi enough to play sound and show color on the screen, so now i need to understand timing. I have looked for a long time before asking this question on the internet and i am sure that i am missing something, but i can't quite put my finger on it :(
here is the code :
#include <windows.h>
#include <iostream>
// in this program i am trying to make a simple function that prints frame: and the number frame in between fixed time intervals
// i am trying to make it so that it doesn't do busy waiting
using namespace std;
void frame(LARGE_INTEGER& T, LARGE_INTEGER& T3, LARGE_INTEGER& DELT,LARGE_INTEGER& DESI, double& framepersec,unsigned long long& count,unsigned long long& maxcount,bool& on, LARGE_INTEGER& mili)
{
QueryPerformanceCounter(&T3); // seccond measurement
DELT.QuadPart = &T3.QuadPart - &T.QuadPart; // getting the ticks between the time measurements
if(DELT.QuadPart >= DESI.QuadPart) {count++; cout << "frame: " << count << " !" << endl; T.QuadPart = T3.QuadPart; } // adding to the count by just one frame (this may cause problems if more than one passes)
if(count > maxcount) {on = false;} // turning off the loop
else {DESI.QuadPart = T.QuadPart + DESI.QuadPart;//(long long)framepersec; // setting the stop tick
unsigned long long sleep = (( DESI.QuadPart - DELT.QuadPart) / mili.QuadPart);
cout << sleep << endl;
Sleep(sleep);} // sleeping to avoid busy waiting
}
int main()
{
LARGE_INTEGER T1, T2, Freq, Delta, desired, mil;
bool loopon = true; // keeps the loop flowing until max frames has been reached
QueryPerformanceFrequency(&Freq); // getting num of updates per second
mil.QuadPart = Freq.QuadPart / 1000; // getting the number clock updates that occur in a millisecond
double framespersec; // the number of clock updates that occur per target frame
unsigned long long framecount,maxcount; //to stop the program after a certain amount of frames
framecount = 0;
cout << "Hello world! enter the amount of frames per second : " << endl;
cin >> framespersec;
cout << "you entered: " << framespersec << " ! how many max frames?" << endl;
cin >> maxcount;
cout << "you entered: " << maxcount << " ! now doing the frames !!!" << endl;
desired.QuadPart = (Freq.QuadPart / framespersec);
while(loopon == true) {
frame(T1, T2, Delta, desired, framespersec, framecount, maxcount,loopon, mil);
}
cout << "all frames are done!" << endl;
return 0;
}
The time that you sleep is limited by the frequency of the system clock. The frequency defaults to 64 Hz, so you'll end up seeing sleeps in increments of 16ms. Any sleep that's less than 16ms will be at least 16ms long - it could be longer depending on CPU load. Likewise, a sleep of 20ms will likely be rounded up to 32ms.
You can change this period by calling timeBeginPeriod(...) and timeEndPeriod(...), which can increase sleep accuracy to 1ms. If you have a look at multimedia apps like VLC Player, you'll see that they use these functions to get reliable frame timing. Note that this changes the system wide scheduling rate, so it will affect battery life on laptops.
More info:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd757624%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298%28v=vs.85%29.aspx
Waitable timers are more accurate than Sleep, and also integrate with a GUI message loop better (replace GetMessage with MsgWaitForMultipleObjects). I've used them successfully for graphics timing before.
They won't get you high precision for e.g. controlling serial or network output at sub-millisecond timing, but UI updates are limited by VSYNC anyway.
Related
I recently started experimenting with std::thread and I tried running a small program that displays the webcam feed in a separate thread and I am using OpenCV. I am just doing this for "educational" purposes. What I noticed was that the thread seemed to keep jumping between cores which striked me as odd since I thought that the overhead of this change would not be worth it from an efficiency/performance side of view. Does anybody know the root/reason for such behavior?
Short disclaimer --> I am new to StackOverflow so if I missed something, please let me know.
A snapshot of my system monitor - Ubuntu
#include <stdio.h>
#include <opencv2/opencv.hpp> //openCV functionality
#include <time.h> //timing functionality
#include <thread>
using namespace cv;
using namespace std;
void webcam_func(){
Mat image;
namedWindow("Display window");
VideoCapture cap(0);
if (!cap.set(CAP_PROP_AUTO_EXPOSURE , 10)){
std::cout <<"Exposure could not be set!" <<std::endl;
//return -1 ;
}
if (!cap.isOpened()) {
cout << "cannot open camera";
}
int i = 0;
while (i < 1000000) {
cap >> image;
Size s = image.size();
int rows = s.height;
int cols = s.width;
imshow("Display window", image);
double fps = cap.get(CAP_PROP_FPS);
//cout << "Frames per second using video.get(CAP_PROP_FPS) : " << fps << endl;
//cout <<"The height of the video is " <<rows <<endl;
//cout <<"The width of the video is " <<cols <<endl;
std::thread::id this_id = std::this_thread::get_id();
std::cout << "thread id --> " << this_id <<std::endl;
waitKey(25);
i++ ;
std::cout <<"Counter value " <<i <<std::endl;
}
}
int main() {
std::thread t1(webcam_func);
while(true){
}
return 0;
}
The default Linux scheduler schedule tasks (eg. threads) for a given quantum (time slice) on available processing units (eg. cores or hardware threads). This quantum can be interrupted if a task enters in sleeping mode or wait for something (inputs, locks, etc.). waitKey(25) exactly does that: it causes your thread to wait for a short period of time. The thread execution is interrupted and a context-switch is done. The OS can execute other tasks during this time. When the computing thread is ready again (because >25 ms has elapsed), the scheduler can schedule it again. It tries to execute the task on the same processing unit so to reduce overheads (eg. cache misses) but the previous processing unit can be still used by another thread when the computing task is being scheduled back. This is unlikely to be the case when there is not many ready tasks or just greedy ones though. Additionally, some processors supports SMT (aka. hyper-threading). For example, many x86-64 Intel processors supports 2 hardware threads per core sharing the same caches. Context-switches between 2 hardware threads lying on the same core are significantly cheaper (eg. far less cache-misses). Also note that the Linux scheduler is not perfect like most other schedulers. In fact, it was bogus few years ago and not even able to fill all available cores when it was possible (see: The Linux Scheduler: a Decade of Wasted Cores). Finally, note that the (direct) overhead of a context-switch is no more than few dozens of micro-seconds on a mainstream Linux PC so having them every few dozens of milliseconds is fine (<1% overhead).
I have been looking at the performance of our C++ server application running on embedded Linux (ARM). The pseudo code for the main processing loop of the server is this -
for i = 1 to 1000
Process item i
Sleep for 20 ms
The processing for one item takes about 2ms. The "Sleep" here is really a call to the Poco library to do a "tryWait" on an event. If the event is fired (which it never is in my tests) or the time expires, it comes returns. I don't know what system call this equates to. Although we ask for a 2ms block, it turns out to be roughly 20ms. I can live with that - that's not the problem. The sleep is just an artificial delay so that other threads in the process are not starved.
The loop takes about 24 seconds to go through 1000 items.
The problem is, we changed the way the sleep is used so that we had a bit more control. I mean - 20ms delay for 2ms processing doesn't allow us to do much processing. With this new parameter set to a certain value it does something like this -
For i = 1 to 1000
Process item i
if i % 50 == 0 then sleep for 1000ms
That's the rough code, in reality the number of sleeps is slightly different and it happens to work out at a 24s cycle to get through all the items - just as before.
So we are doing exactly the same amount of processing in the same amount of time.
Problem 1 - the CPU usage for the original code is reported at around 1% (it varies a little but that's about average) and the CPU usage reported for the new code is about 5%. I think they should be the same.
Well perhaps this CPU reporting isn't accurate so I thought I'd sort a large text file at the same time and see how much it's slowed up by our server. This is a CPU bound process (98% CPU usage according to top). The results are very odd. With the old code, the time taken to sort the file goes up by 21% when our server is running.
Problem 2 - If the server is only using 1% of the CPU then wouldn't the time taken to do the sort be pretty much the same?
Also, the time taken to go through all the items doesn't change - it's still 24 seconds with or without the sort running.
Then I tried the new code, it only slows the sort down by about 12% but it now takes about 40% longer to get through all the items it has to process.
Problem 3 - Why do the two ways of introducing an artificial delay cause such different results. It seems that the server which sleeps more frequently but for a minimum time is getting more priority.
I have a half baked theory on the last one - whatever the system call that is used to do the "sleep" is switching back to the server process when the time is elapsed. This gives the process another bite at the time slice on a regular basis.
Any help appreciated. I suspect I'm just not understanding it correctly and that things are more complicated than I thought. I can provide more details if required.
Thanks.
Update: replaced tryWait(2) with usleep(2000) - no change. In fact, sched_yield() does the same.
Well I can at least answer problem 1 and problem 2 (as they are the same issue).
After trying out various options in the actual server code, we came to the conclusion that the CPU reporting from the OS is incorrect. It's quite result so to make sure, I wrote a stand alone program that doesn't use Poco or any of our code. Just plain Linux system calls and standard C++ features. It implements the pseudo code above. The processing is replaced with a tight loop just checking the elapsed time to see if 2ms is up. The sleeps are proper sleeps.
The small test program shows exactly the same problem. i.e. doing the same amount of processing but splitting up the way the sleep function is called, produces very different results for CPU usage. In the case of the test program, the reported CPU usage was 0.0078 seconds using 1000 20ms sleeps but 1.96875 when a less frequent 1000ms sleep was used. The amount of processing done is the same.
Running the test on a Linux PC did not show the problem. Both ways of sleeping produced exactly the same CPU usage.
So clearly a problem with our embedded system and the way it measures CPU time when a process is yielding so often (you get the same problem with sched_yeild instead of a sleep).
Update: Here's the code. RunLoop is where the main bit is done -
int sleepCount;
double getCPUTime( )
{
clockid_t id = CLOCK_PROCESS_CPUTIME_ID;
struct timespec ts;
if ( id != (clockid_t)-1 && clock_gettime( id, &ts ) != -1 )
return (double)ts.tv_sec +
(double)ts.tv_nsec / 1000000000.0;
return -1;
}
double GetElapsedMilliseconds(const timeval& startTime)
{
timeval endTime;
gettimeofday(&endTime, NULL);
double elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // us to ms
return elapsedTime;
}
void SleepMilliseconds(int milliseconds)
{
timeval startTime;
gettimeofday(&startTime, NULL);
usleep(milliseconds * 1000);
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > milliseconds + 0.3)
std::cout << "Sleep took longer than it should " << elapsedMilliseconds;
sleepCount++;
}
void DoSomeProcessingForAnItem()
{
timeval startTime;
gettimeofday(&startTime, NULL);
double processingTimeMilliseconds = 2.0;
double elapsedMilliseconds;
do
{
elapsedMilliseconds = GetElapsedMilliseconds(startTime);
} while (elapsedMilliseconds <= processingTimeMilliseconds);
if (elapsedMilliseconds > processingTimeMilliseconds + 0.1)
std::cout << "Processing took longer than it should " << elapsedMilliseconds;
}
void RunLoop(bool longSleep)
{
int numberOfItems = 1000;
timeval startTime;
gettimeofday(&startTime, NULL);
timeval startMainLoopTime;
gettimeofday(&startMainLoopTime, NULL);
for (int i = 0; i < numberOfItems; i++)
{
DoSomeProcessingForAnItem();
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > 100)
{
std::cout << "Item count = " << i << "\n";
if (longSleep)
{
SleepMilliseconds(1000);
}
gettimeofday(&startTime, NULL);
}
if (longSleep == false)
{
// Does 1000 * 20 ms sleeps.
SleepMilliseconds(20);
}
}
double elapsedMilliseconds = GetElapsedMilliseconds(startMainLoopTime);
std::cout << "Main loop took " << elapsedMilliseconds / 1000 <<" seconds\n";
}
void DoTest(bool longSleep)
{
timeval startTime;
gettimeofday(&startTime, NULL);
double startCPUtime = getCPUTime();
sleepCount = 0;
int runLoopCount = 1;
for (int i = 0; i < runLoopCount; i++)
{
RunLoop(longSleep);
std::cout << "**** Done one loop of processing ****\n";
}
double endCPUtime = getCPUTime();
std::cout << "Elapsed time is " <<GetElapsedMilliseconds(startTime) / 1000 << " seconds\n";
std::cout << "CPU time used is " << endCPUtime - startCPUtime << " seconds\n";
std::cout << "Sleep count " << sleepCount << "\n";
}
void testLong()
{
std::cout << "Running testLong\n";
DoTest(true);
}
void testShort()
{
std::cout << "Running testShort\n";
DoTest(false);
}
I am doing a benchmark project between two graphical libraries (SDL, SFML) for my final cs project. I got it almost finished, however when I benchmark the speed of playing sounds, it always returns time taken 0, no matter how many loops he does. Do you know whats wrong with my code? The sound actually plays, however I should probably do some other algorithm.
void playSound()
{
Mix_PlayChannel(-1, sound, 0);
}
void soundBenchmark(int numOfCycles)
{
int time = SDL_GetTicks(), timeRequired;
for(int i = 0; i < numOfCycles; i++) playSound();
timeRequired = SDL_GetTicks() - time;
cout << "Time required for " << numOfCycles << " cycles: " << timeRequired << " seconds.\n";
}
The function Mix_PlayChannel() does not block the execution of the code. The function just send the data to the sound card( or equivalent) and returns.
You are going to have to remember the channel you used with Mix_PlayChannel() and then check periodically with Mix_Playing() whether that channel is playing or not and look at the time.
Is there any way in C++ to calculate how long does it take to run a given program or routine in CPU time?
I work with Visual Studio 2008 running on Windows 7.
If you want to know the total amount of CPU time used by a process, neither clock nor rdtsc (either directly or via a compiler intrinsic) is really the best choice, at least IMO. If you need the code to be portable, about the best you can do is use clock, test with the system as quiescent as possible, and hope for the best (but if you do, be aware that the resolution of clock is CLOCKS_PER_SEC, which may or may not be 1000, and even if it is, your actual timing resolution often won't be that good -- it may give you times in milliseconds, but at least normally advance tens of milliseconds at a time).
Since, however, you don't seem to mind the code being specific to Windows, you can do quite a bit better. At least if my understanding of what you're looking for is correctly, what you really want is probably GetProcessTimes, which will (separately) tell you both kernel-mode and user-mode CPU usage of the process (as well as the start time and exit time, from which you can compute wall time used, if you care). There's also QueryProcessCycleTime, which will tell you the total number of CPU clock cycles used by the process (total of both user and kernel mode in all threads). Personally, I have a hard time imagining much use for the latter though -- counting individual clock cycles can be useful for small sections of code subject to intensive optimization, but I'm less certain about how you'd apply it to a complete process. GetProcessTimes uses FILETIME structures, which support resolutions of 100 nanoseconds, but in reality most times you'll see will be multiples of the scheduler's time slice (which varies with the version of windows, but is on the order of milliseconds to tens of milliseconds).
In any case, if you truly want time from beginning to end, GetProcessTimes will let you do that -- if you spawn the program (e.g., with CreateProcess), you'll get a handle to the process which will be signaled when the child process exits. You can then call GetProcessTimes on that handle, and retrieve the times even though the child has already exited -- the handle will remain valid as long as at least one handle to the process remains open.
Here's one way. It measures routine exeution time in milliseconds.
clock_t begin=clock(); starts before the route is executed and clock_t end=clock(); starts right after the routine exits.
The two time sets are then subtracted from each other and the result is a millisecod value.
#include <stdio.h>
#include <iostream>
#include <time.h>
using namespace std;
double get_CPU_time_usage(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks*1000)/CLOCKS_PER_SEC;
return diffms;
}
void test_CPU_usage()
{
cout << "Standby.. measuring exeution time: ";
for (int i=0; i<10000;i++)
{
cout << "\b\\" << std::flush;
cout << "\b|" << std::flush;
cout << "\b/" << std::flush;
cout << "\b-" << std::flush;
}
cout << " \n\n";
}
int main (void)
{
clock_t begin=clock();
test_CPU_usage();
clock_t end=clock();
cout << "Time elapsed: " << double(get_CPU_time_usage(end,begin)) << " ms ("<<double(get_CPU_time_usage(end,begin))/1000<<" sec) \n\n";
return 0;
}
The __rdtscp intrinsic will give you the time in CPU cycles with some caveats.
Here's the MSDN article
It depends really what you want to measure. For better results take the average of a few million (if not billion) iterations.
The clock() function [as provided by Visual C++ 2008] doesn't return processor time used by the program, while it should (according to the C standard and/or C++ standard). That said, to measure CPU time on Windows, I have this helper class (which is inevitably non-portable):
class ProcessorTimer
{
public:
ProcessorTimer() { start(); }
void start() { ::GetProcessTimes(::GetCurrentProcess(), &ft_[3], &ft_[2], &ft_[1], &ft_[0]); }
std::tuple<double, double> stop()
{
::GetProcessTimes(::GetCurrentProcess(), &ft_[5], &ft_[4], &ft_[3], &ft_[2]);
ULARGE_INTEGER u[4];
for (size_t i = 0; i < 4; ++i)
{
u[i].LowPart = ft_[i].dwLowDateTime;
u[i].HighPart = ft_[i].dwHighDateTime;
}
double user = (u[2].QuadPart - u[0].QuadPart) / 10000000.0;
double kernel = (u[3].QuadPart - u[1].QuadPart) / 10000000.0;
return std::make_tuple(user, kernel);
}
private:
FILETIME ft_[6];
};
class ScopedProcessorTimer
{
public:
ScopedProcessorTimer(std::ostream& os = std::cerr) : timer_(ProcessorTimer()), os_(os) { }
~ScopedProcessorTimer()
{
std::tuple<double, double> t = timer_.stop();
os_ << "user " << std::get<0>(t) << "\n";
os_ << "kernel " << std::get<1>(t) << "\n";
}
private:
ProcessorTimer timer_;
std::ostream& os_;
}
For example, one can measure how long it takes a block to execute, by defining a ScopedProcessorTimer at the beginning of that {} block.
This Code is Process Cpu Usage
ULONGLONG LastCycleTime = NULL;
LARGE_INTEGER LastPCounter;
LastPCounter.QuadPart = 0; // LARGE_INTEGER Init
// cpu get core number
SYSTEM_INFO sysInfo;
GetSystemInfo(&sysInfo);
int numProcessors = sysInfo.dwNumberOfProcessors;
HANDLE hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, FALSE, Process::pid);
if (hProcess == NULL)
nResult = 0;
int count = 0;
while (true)
{
ULONG64 CycleTime;
LARGE_INTEGER qpcLastInt;
if (!QueryProcessCycleTime(hProcess, &CycleTime))
nResult = 0;
ULONG64 cycle = CycleTime - LastCycleTime;
if (!QueryPerformanceCounter(&qpcLastInt))
nResult = 0;
double Usage = cycle / ((double)(qpcLastInt.QuadPart - LastPCounter.QuadPart));
// Scaling
Usage *= 1.0 / numProcessors;
Usage *= 0.1;
LastPCounter = qpcLastInt;
LastCycleTime = CycleTime;
if (count > 3)
{
printf("%.1f", Usage);
break;
}
Sleep(1); // QueryPerformanceCounter Function Resolution is 1 microsecond
count++;
}
CloseHandle(hProcess);
I'm currently trying to code a certain dynamic programming approach for a vehicle routing problem. At a certain point, I have a partial route that I want to add to a minmaxheap in order to keep the best 100 partial routes at a same stage. Most of the program runs smooth but when I actually want to insert a partial route into the heap, things tend to go a bit slow. That particural code is shown below:
clock_t insert_start, insert_finish, check1_finish, check2_finish;
insert_start = clock();
check2_finish = clock();
if(heap.get_vector_size() < 100) {
check1_finish= clock();
heap.insert(expansion);
cout << "node added" << endl;
}
else {
check1_finish = clock();
if(expansion.get_cost() < heap.find_max().get_cost() ) {
check2_finish = clock();
heap.delete_max();
heap.insert(expansion);
cout<< "worst node deleted and better one added" <<endl;
}
else {
check2_finish = clock();
cout << "cost too high check"<<endl;
}
}
number_expansions++;
cout << "check 1 takes " << check1_finish - insert_start << " ms" << endl;
cout << "check 2 takes " << check2_finish - check1_finish << "ms " << endl;
insert_finish = clock();
cout << "Inserting an expanded state into the heap takes " << insert_finish - insert_start << " clocks" << endl;
A typical output is this:
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 16 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
I know it's hard to say something about the code when this block uses functions that are implemented elsewhere but I'm flabbergasted as to why this sometimes takes less than a ms and sometimes takes up to 16 ms. The program should execute this block thousands of times so these small hiccups are really slowing things down enormously.
My only guess is that something happens with the vector in the heap class that stores all these states but I reserve place for a 100 items in the constructor using vector::reserve so I don't see how this could still be a problem.
Thanks!
Preempting. Your program may be preempted by the operating system, so some other program can run for a bit.
Also, it's not 16 ms. It's 16 clock ticks: http://www.cplusplus.com/reference/clibrary/ctime/clock/
If you want ms, you need to do:
cout << "Inserting an expanded state into the heap takes "
<< (insert_finish - insert_start) * 1000 / CLOCKS_PER_SEC
<< " ms " << endl;
Finally, you're setting insert_finish after printing out the other results. Try setting it immediately after your if/else block. The cout command is a good time to get preempted by another process.
My only guess is that something
happens with the vector in the heap
class that stores all these states but
I reserve place for a 100 items in the
constructor using vector::reserve so I
don't see how this could still be a
problem.
Are you using std::vector to implement it? Insert is taking linear time for std::vector. Also delete max is can take time if you are not using a sorted container.
I will suggest you to use a std::set or std::multiset. Insert, delete and find take always ln(n).
Try to measure time using QueryPerformanceCounter, because I think that clock function could not be very accurate. Probably clock has the same accuracy as windows scheduler - 10 ms for single cpu and 15 or 16 ms for multicore cpu. QueryPerformanceCounter together with QueryPerformanceFreq can give you nanosecond resolution.
It looks like you are measureing "wall time", not CPU time. Windows itself is not a realtime OS. Occasional large hiccups from high-priority things like device drivers is not at all uncommon.
On Windows if I'm manually trying to look for bottlenecks in code, I use RDTSC instead. Even better would be to not do it manually, but use a profiler.