Busy Loop/Spinning sometimes takes too long under Windows

Busy Loop/Spinning sometimes takes too long under Windows - c++

I'm using a windows 7 PC to output voltages at a rate of 1kHz. At first I simply ended the thread with sleep_until(nextStartTime), however this has proven to be unreliable, sometimes working fine and sometimes being of by up to 10ms.
I found other answers here saying that a busy loop might be more accurate, however mine for some reason also sometimes takes too long.
while (true) {
doStuff(); //is quick enough
logDelays();
nextStartTime = chrono::high_resolution_clock::now() + chrono::milliseconds(1);
spinStart = chrono::high_resolution_clock::now();
while (chrono::duration_cast<chrono::microseconds>(nextStartTime -
chrono::high_resolution_clock::now()).count() > 200) {
spinCount++; //a volatile int
}
int spintime = chrono::duration_cast<chrono::microseconds>
(chrono::high_resolution_clock::now() - spinStart).count();
cout << "Spin Time micros :" << spintime << endl;
if (spinCount > 100000000) {
cout << "reset spincount" << endl;
spinCount = 0;
}
}
I was hoping that this would work to fix my issue, however it produces the output:
Spin Time micros :9999
Spin Time micros :9999
...
I've been stuck on this problem for the last 5 hours and I'd very thankful if somebody knows a solution.

According to the comments this code waits correctly:
auto start = std::chrono::high_resolution_clock::now();
const auto delay = std::chrono::milliseconds(1);
while (true) {
doStuff(); //is quick enough
logDelays();
auto spinStart = std::chrono::high_resolution_clock::now();
while (start > std::chrono::high_resolution_clock::now() + delay) {}
int spintime = std::chrono::duration_cast<std::chrono::microseconds>
(std::chrono::high_resolution_clock::now() - spinStart).count();
std::cout << "Spin Time micros :" << spintime << std::endl;
start += delay;
}
The important part is the busy-wait while (start > std::chrono::high_resolution_clock::now() + delay) {} and start += delay; which will in combination make sure that delay amount of time is waited, even when outside factors (windows update keeping the system busy) disturb it. In case that the loop takes longer than delay the loop will be executed without waiting until it catches up (which may be never if doStuff is sufficiently slow).
Note that missing an update (due to the system being busy) and then sending 2 at once to catch up might not be the best way to handle the situation. You may want to check the current time inside doStuff and abort/restart the transmission if the timing is wrong by more then some acceptable amount.

On Windows I dont think its possible to ever get such precise timing, because you can not garuntee your thread is actually running at the time you desire. Even with low CPU usage and setting your thread to real time priority, it can still be interuptted (Hardware interupts as I understand. Never fully investigate but even a simple while(true) ++i; type loop at realtime Ive seen get interupted then moved between CPU cores). While such interrupts and switching for a realtime thread is very quick, its still significant if your trying to directly drive a signal without buffering.
Instead you really want to read and write buffers of digital samples (so at 1KHz each sample is 1ms). You need to be sure to queue another buffer before the last one is completed, which will constrain how small they can be, but at 1KHz at realtime priority if the code is simple and no other CPU contention a single sample buffer (1ms) might even be possible, which is at worst 1ms extra latency over "immediate" but you would have to test. You then leave it up to the hardware and its drivers to handle the precise timing (e.g. make sure each output sample is "exactly" 1ms to the accuracy the vendor claims).
This basically means your code only has to be accurate to 1ms in worst case, rather than trying to persue somthing far smaller than the OS really supports such as microsecond accuracy.
As long as you are able to queue a new buffer before the hardware used up the previous buffer, it will be able to run at the desired frequency without issue (to use audio as an example again, while the tolerated latencies are often much higher and thus the buffers as well, if you overload the CPU you can still sometimes hear auidble glitches where an application didnt queue up new raw audio in time).
With careful timing you might even be able to get down to a fraction of a millisecond by waiting to process and queue your next sample as long as possible (e.g. if you need to reduce latency between input and output), but remember that the closer you cut it the more you risk submitting it too late.

Related

C++ How to make precise frame rate limit?

I'm trying to create a game using C++ and I want to create limit for fps but I always get more or less fps than I want. When I look at games that have fps limit it's always precise framerate. Tried using Sleep() std::this_thread::sleep_for(sleep_until). For example Sleep(0.01-deltaTime) to get 100 fps but ended up with +-90fps.
How do these games handle fps so precisely when any sleeping isn't precise?
I know I can use infinite loop that just checks if time passed but it's using full power of CPU but I want to decrease CPU usage by this limit without VSync.

Yes, sleep is usually inaccurate. That is why you sleep for less than the actual time it takes to finish the frame. For example, if you need 5 more milliseconds to finish the frame, then sleep for 4 milliseconds. After the sleep, simply do a spin-lock for the rest of the frame. Something like
float TimeRemaining = NextFrameTime - GetCurrentTime();
Sleep(ConvertToMilliseconds(TimeRemaining) - 1);
while (GetCurrentTime() < NextFrameTime) {};
Edit: as stated in another answer, timeBeginPeriod() should be called to increase the accuracy of Sleep(). Also, from what I've read, Windows will automatically call timeEndPeriod() when your process exits if you don't before then.

You could record the time point when you start, add a fixed duration to it and sleep until the calculated time point occurs at the end (or beginning) of every loop. Example:
#include <chrono>
#include <iostream>
#include <ratio>
#include <thread>
template<std::intmax_t FPS>
class frame_rater {
public:
frame_rater() : // initialize the object keeping the pace
time_between_frames{1}, // std::ratio<1, FPS> seconds
tp{std::chrono::steady_clock::now()}
{}
void sleep() {
// add to time point
tp += time_between_frames;
// and sleep until that time point
std::this_thread::sleep_until(tp);
}
private:
// a duration with a length of 1/FPS seconds
std::chrono::duration<double, std::ratio<1, FPS>> time_between_frames;
// the time point we'll add to in every loop
std::chrono::time_point<std::chrono::steady_clock, decltype(time_between_frames)> tp;
};
// this should print ~10 times per second pretty accurately
int main() {
frame_rater<10> fr; // 10 FPS
while(true) {
std::cout << "Hello world\n";
fr.sleep(); // let it sleep any time remaining
}
}

The accepted answer sounds really bad. It would not be accurate and it would burn the CPU!
Thread.Sleep is not accurate because you have to tell it to be accurate (by default is about 15ms accurate - means that if you tell it to sleep 1ms it could sleep 15ms).
You can do this with Win32 API call to timeBeginPeriod & timeEndPeriod functions.
Check MSDN for more details -> https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
(I would comment on the accepted answer but still not having 50 reputation)

Be very careful when implementing any wait that is based on scheduler sleep.
Most OS schedulers have higher latency turn-around for a wait with no well-defined interval or signal to bring the thread back into the ready-to-run state.
Sleeping isn't inaccurate per-se, you're just approaching the problem all wrong. If you have access to something like DXGI's Waitable Swapchain, you synchronize to the DWM's present queue and get really reliable low-latency timing.
You don't need to busy-wait to get accurate timing, a waitable timer will give you a sync object to reschedule your thread.
Whatever you do, do not use the currently accepted answer in production code. There's an edge case here you WANT TO AVOID, where Sleep (0) does not yield CPU time to higher priority threads. I've seen so many game devs try Sleep (0) and it's going to cause you major problems.

Use a timer.
Some OS's can provide special functions. For example, for Windows you can use SetTimer and handle its WM_TIMER messages.
Then calculate the frequency of the timer. 100 fps means that the timer must fire an event each 0.01 seconds.
At the event handler for this timer-event you can do your rendering.
In case the rendering is slower than the desired frequency then use a syncro flag OpenGL sync and discard the timer-event if the previous rendering is not complete.

You may set a const fps variable to your desired frame rate, then you can update your game if the elapsed time from last update is equal or more than 1 / desired_fps.
This will probably work.
Example:
const /*or constexpr*/ int fps{60};
// then at update loop.
while(running)
{
// update the game timer.
timer->update();
// check for any events.
if(timer->ElapsedTime() >= 1 / fps)
{
// do your updates and THEN renderer.
}
}

Why does Sleep() slow down subsequent code for 40ms?

I originally asked about this at coderanch.com, so if you've tried to assist me there, thanks, and don't feel obliged to repeat the effort. coderanch.com is mostly a Java community, though, and this appears (after some research) to really be a Windows question, so my colleagues there and I thought this might be a more appropriate place to look for help.
I have written a short program that either spins on the Windows performance counter until 33ms have passed, or else calls Sleep(33). The former exhibits no unexpected effects, but the latter appears to (inconsistently) slow subsequent processing for about 40ms (either that, or it has some effect on the values returned from the performance counter for that long). After the spin or Sleep(), the program calls a routine, runInPlace(), that spins for 2ms, counting the number of times it queries the performance counter, and returning that number.
When the initial 33ms delay is done by spinning, the number of iterations of runInPlace() tends to be (on my Windows 10, XPS-8700) about 250,000. It varies, probably due to other system overhead, but it varies smoothing around 250,000.
Now, when the initial delay is done by calling Sleep(), something strange happens. A lot of the calls to runInPlace() return a number near 250,000, but quite a few of them return a number near 50,000. Again, the range varies around 50,000, fairly smoothly. But, it is clearly averaging one or the other, with nearly no returns anywhere between 80,000 and 150,000. If I call runInPlace() 100 times after each delay, instead of just once, it never returns a number of iterations in the smaller range after the 20th call. As runInPlace() runs for 2ms, this means the behavior I'm observing disappears after 40ms. If I have runInPlace() run for 4ms instead of 2ms, it never returns a number of iterations in the smaller range after the 10th call, so, again, the behavior disappears after 40ms (likewise if have runInPlace() run for only 1ms; the behavior disappears after the 40th call).
Here's my code:
#include "stdafx.h"
#include "Windows.h"
int runInPlace(int msDelay)
{
LARGE_INTEGER t0, t1;
int n = 0;
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
n++;
} while (t1.QuadPart - t0.QuadPart < msDelay);
return n;
}
int _tmain(int argc, _TCHAR* argv[])
{
LARGE_INTEGER t0, t1;
LARGE_INTEGER frequency;
int n;
QueryPerformanceFrequency(&frequency);
int msDelay = 2 * frequency.QuadPart / 1000;
int spinDelay = 33 * frequency.QuadPart / 1000;
for (int i = 0; i < 100; i++)
{
if (argc > 1)
Sleep(33);
else
{
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
} while (t1.QuadPart - t0.QuadPart < spinDelay);
}
n = runInPlace(msDelay);
printf("%d \n", n);
}
getchar();
return 0;
}
Here's some output typical of what I get when using Sleep() for the delay:
56116
248936
53659
34311
233488
54921
47904
45765
31454
55633
55870
55607
32363
219810
211400
216358
274039
244635
152282
151779
43057
37442
251658
53813
56237
259858
252275
251099
And here's some output typical of what I get when I spin to create the delay:
276461
280869
276215
280850
188066
280666
281139
280904
277886
279250
244671
240599
279697
280844
159246
271938
263632
260892
238902
255570
265652
274005
273604
150640
279153
281146
280845
248277
Can anyone help me understand this behavior? (Note, I have tried this program, compiled with Visual C++ 2010 Express, on five computers. It only shows this behavior on the two fastest machines I have.)

This sounds like it is due to the reduced clock speed that the CPU will run at when the computer is not busy (SpeedStep). When the computer is idle (like in a sleep) the clock speed will drop to reduce power consumption. On newer CPUs this can be 35% or less of the listed clock speed. Once the computer gets busy again there is a small delay before the CPU will speed up again.
You can turn off this feature (either in the BIOS or by changing the "Minimum processor state" setting under "Processor power management" in the advanced settings of your power plan to 100%.

Besides what #1201ProgramAlarm said (which may very well be, modern processors are extremely fond of downclocking whenever they can), it may also be a cache warming up problem.
When you ask to sleep for a while the scheduler typically schedules another thread/process for the next CPU time quantum, which means that the caches (instruction cache, data cache, TLB, branch predictor data, ...) relative to your process are going to be "cold" again when your code regains the CPU.

Best option to profile CPU use in my program?

I am profiling CPU usage on a simple program I am writing. I have different algorithms I want to try, and I also want to know what's the impact on the total system performance.
Currently, I am using ualarm() to execute some instructions at 30Hz; every 15 of those interruptions (every 0.5s) I record the CPU time with getrusage() (in useconds), so I have an estimation on the total cpu time of cpu consumption on that point in time. But to get context, I also need to know the total time elapsed in the system in that time period, so I can have the % of which is used by my program.
/* Main Loop */
while(1)
{
alarm = 0;
/* Waiting Loop: */
for(i=0; !alarm; i++){
}
count++;
/* Do my things */
/* Check if it's time to store cpu log: */
if ((count%count_max) == 0)
{
getrusage(RUSAGE_SELF, &ru);
store_cpulog(f,
(int64_t) ru.ru_utime.tv_sec,
(int64_t) ru.ru_utime.tv_usec,
(int64_t) ru.ru_stime.tv_sec,
(int64_t) ru.ru_stime.tv_usec);
}
}
I have different options, but I don't know which one will provide the most exact result:
Use ualarm for the timing. Currently it's programmed to signal every 0.5 seconds, so I can take those 0.5 seconds as the CPU time. Seems quite obvious to use, but it's the best option?
Use clock_gettime(CLOCK_MONOTONIC): it provides readings with a nanosec resolution.
Use gettimeofday(): provides readings with a usec resolution. I've found opinions against using it.
Any recommendation? Thanks.

Possible solution is to use system function time and don't using busy loop (like #Hasturkun say) in your program. Call in console:
time /path/to/my/program
and after execution of it you get something like:
real 0m1.465s
user 0m0.000s
sys 0m1.210s
Not sure about precision, if it is enough for you.

Callgrind is possibly the best application for profiling C/C++ code under linux. Use it with pride:)

Can i retrieve microseconds or very accurate milliseconds on c++ on windows?

So I made a game loop that uses SDL_Delay function to cap the frames per second, it look like this:
//While the user hasn't qui
while( stateID != STATE_EXIT )
{
//Start the frame timer
fps.start();
//Do state event handling
currentState->handle_events();
//Do state logic
currentState->logic();
//Change state if needed
change_state();
//Do state rendering
currentState->render();
//Update the screen
if( SDL_Flip( screen ) == -1 )
{
return 1;
}
//Cap the frame rate
if( fps.get_ticks() < 1000 / FRAMES_PER_SECOND )
{
SDL_Delay( ( 1000 / FRAMES_PER_SECOND ) - fps.get_ticks() );
}
}
So when I run my games on 60 frames per second (which is the "eye cap" I assume) I can still see laggy type of motion, meaning i see the frames appearing independently causing unsmooth motion.
This is because apparently SDL_Delay function is not too accurate, causing +,- 15 milliseconds or something difference between frames greater than whatever I want it to be.
(all these are just my assumptions)
so I am just searching fo a good and accurate timer that will help me with this problem.
any suggestions?

I think there is a similar question in How to make thread sleep less than a millisecond on Windows
But as a game programmer myself, I don't rely on sleep functions to manage frame-rate (the parameter they take is just a minimum). I just draw stuff on screen as fast as I can. I have a bunch of function calls in my game loop, and then I keep track of how often I'm calling them. For instance, I check input quite often (1000x/second) to make the game more responsive, but I don't check the network inbox more than 100x/second.
For example:
#define NW_CHECK_INTERVAL 10
#define INPUT_CHECK_INTERVAL 1
uint32_t last_nw_check = 0, last_input_check = 0;
while (game_running) {
uint32_t now = SDL_GetTicks();
if (now - last_nw_check > NW_CHECK_INTERVAL) {
check_network();
last_nw_check = now;
}
if (now - last_input_check > INPUT_CHECK_INTERVAL) {
check_input();
last_input_check = now;
}
check_video();
// and so on...
}

Use the QueryPerformanceCounter / Frequency for that.
LARGE_INTEGER start, end, tps; //tps = ticks per second
QueryPerformanceFrequency( &tps );
QueryPerformanceCounter( &start );
QueryPerformanceCounter( &end );
int usPassed = (end.QuadPart - start.QuadPart) * 1000000 / tps.QuadPart;

Here's a small wait function I had created for timing midi sequences using QueryPerformanceCounter:
void wait(int waitTime) {
LARGE_INTEGER time1, time2, freq;
if(waitTime == 0)
return;
QueryPerformanceCounter(&time1);
QueryPerformanceFrequency(&freq);
do {
QueryPerformanceCounter(&time2);
} while((time2.QuadPart - time1.QuadPart) * 1000000ll / freq.QuadPart < waitTime);
}
To convert ticks to microseconds, calculate the difference in ticks, multiply by 1,000,000 (microseconds/second) and divide by the frequency of ticks per second.
Note that some things may throw this off, for instance the precision of the high-resolution counter is not likely to be down to a single microsecond. For example, if you want to wait 10 microseconds and the precision/frequency is one tick every 6 microseconds, your 10 microsecond wait will actually be no less than 12 microseconds. Again, this frequency is system dependent and will vary from system to system.
Also, Windows is not a real-time operating system. A process may be preempted at any time and it is up to Windows to decide when the process is rescheduled. The application may be preempted in the middle of this function and not restarted again until long after the expected wait time has elapsed. There really isn't much you can do about it but you'll probably never notice it if it happens.

60 fame per second is just the frequency of power in US (50 in Europe, Africa and Asia are somehow mixed) and is the frequency of video refreshing for hardware comfortable reasons (It can be an integer multiple on more sophisticated monitors). It was a mandatory constrains for CRT dispaly, and it is still a comfortable reference for LCD (that's how frequently the frame buffer is uploaded to the display)
The eye-cap is no more than 20-25 fps - not to be confused with retina persistency, that's about one-half - and that's why TV interlace two squares upon every refresh.
independently on the timing accuracy, whatever hardware device cannot be updated during its buffer-scan (otherwise the image changes while it is shown, resulting in half-drawn broken frames), hence, if you go faster than one half of the device refresh you are queued behind it and forced to wait for it.
60 fps in a game loop serves only to help CPU manufacturers to sell new faster CPUs. Slow down under 25 and everything will look more fluid.

SDL_Delay:
This function waits a specified number of milliseconds before returning. It waits at least the specified time, but possible longer due to OS scheduling. The delay granularity is at least 10 ms. Some platforms have shorter clock ticks but this is the most common.
The actual delays observed with this function depend on OS settings. I'd suggest to look into the
Mutimedia Timer API, particulary into the timeBeginPeriod function, to adapt the interrupt frequency to your requirements.
Obtaining and Setting Timer Resolution shows an example how to change the interrupt period to about 1ms. This way you don't have the 15ms hickup anymore. BTW: Eye-catch period is about 40ms.
Obtaining fixed period timing can also be addressed by Waitable Timer Objects. But the use of mutimedia timers is mandatory to obtain decent resolution, no matter what.
Using other tools to improve the timing capabilities is discussed here.

Limit iterations per time unit

Is there a way to limit iterations per time unit? For example, I have a loop like this:
for (int i = 0; i < 100000; i++)
{
// do stuff
}
I want to limit the loop above so there will be maximum of 30 iterations per second.
I would also like the iterations to be evenly positioned in the timeline so not something like 30 iterations in first 0.4s and then wait 0.6s.
Is that possible? It does not have to be completely precise (though the more precise it will be the better).

#FredOverflow My program is running
very fast. It is sending data over
wifi to another program which is not
fast enough to handle them at the
current rate. – Richard Knop
Then you should probably have the program you're sending data to send an acknowledgment when it's finished receiving the last chunk of data you sent then send the next chunk. Anything else will just cause you frustrations down the line as circumstances change.

Suppose you have a good Now() function (GetTickCount() is bad example, it's OS specific and has bad precision):
for (int i = 0; i < 1000; i++){
DWORD have_to_sleep_until = GetTickCount() + EXPECTED_ITERATION_TIME_MS;
// do stuff
Sleep(max(0, have_to_sleep_until - GetTickCount()));
};

You can check elapsed time inside the loop, but it may be not an usual solution. Because computation time is totally up to the performance of the machine and algorithm, people optimize it during their development time(ex. many game programmer requires at least 25-30 frames per second for properly smooth animation).

easiest way (for windows) is to use QueryPerformanceCounter(). Some pseudo-code below.
QueryPerformanceFrequency(&freq)
timeWanted = 1.0/30.0 //time per iteration if 30 iterations / sec
for i
QueryPerf(count1)
do stuff
queryPerf(count2)
timeElapsed = (double)(c2 - c1) * (double)(1e3) / double(freq) //time in milliseconds
timeDiff = timeWanted - timeElapsed
if (timeDiff > 0)
QueryPerf(c3)
QueryPerf(c4)
while ((double)(c4 - c3) * (double)(1e3) / double(freq) < timeDiff)
queryPerf(c4)
end for
EDIT: You must make sure that the 'do stuff' area takes less time than your framerate or else it doesn't matter. Also instead of 1e3 for milliseconds, you can go all the way to nanoseconds if you do 1e9 (if you want that much accuracy)
WARNING... this will eat your CPU but give you good 'software' timing... Do it in a separate thread (and only if you have more than 1 processor) so that any guis wont lock. You can put a conditional in there to stop the loop if this is a multi-threaded app too.

#FredOverflow My program is running very fast. It is sending data over wifi to another program which is not fast enough to handle them at the current rate. – Richard Knop
What you might need a buffer or queue at the receiver side. The thread that receives the messages from the client (like through a socket) get the message and put it in the queue. The actual consumer of the messages reads/pops from the queue. Of course you need concurrency control for your queue.

Besides the flow control methods mentioned, if you also have the need to maintain an accurate specific data sending rate in your sender part. Usually it can be done like this.
E.x. if you want to send at 10Mbps, create a timer of interval 1ms so it will call a predefined function every 1ms. Then in the timer handler function, by keep tracking of 2 static variables 1)Time elapsed since beginning of sending data 2)How much data in bytes have been sent up to last call, you can easily calculate how much data is needed to be sent in the current call (or just sleep and wait for next call).
By this way, you can do "streaming" of data in a very stable way with very little jitterness, and this is usually adopted in streaming of videos. Of course it also depends on how accurate the timer is.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js