Why does Sleep() slow down subsequent code for 40ms? - c++

I originally asked about this at coderanch.com, so if you've tried to assist me there, thanks, and don't feel obliged to repeat the effort. coderanch.com is mostly a Java community, though, and this appears (after some research) to really be a Windows question, so my colleagues there and I thought this might be a more appropriate place to look for help.
I have written a short program that either spins on the Windows performance counter until 33ms have passed, or else calls Sleep(33). The former exhibits no unexpected effects, but the latter appears to (inconsistently) slow subsequent processing for about 40ms (either that, or it has some effect on the values returned from the performance counter for that long). After the spin or Sleep(), the program calls a routine, runInPlace(), that spins for 2ms, counting the number of times it queries the performance counter, and returning that number.
When the initial 33ms delay is done by spinning, the number of iterations of runInPlace() tends to be (on my Windows 10, XPS-8700) about 250,000. It varies, probably due to other system overhead, but it varies smoothing around 250,000.
Now, when the initial delay is done by calling Sleep(), something strange happens. A lot of the calls to runInPlace() return a number near 250,000, but quite a few of them return a number near 50,000. Again, the range varies around 50,000, fairly smoothly. But, it is clearly averaging one or the other, with nearly no returns anywhere between 80,000 and 150,000. If I call runInPlace() 100 times after each delay, instead of just once, it never returns a number of iterations in the smaller range after the 20th call. As runInPlace() runs for 2ms, this means the behavior I'm observing disappears after 40ms. If I have runInPlace() run for 4ms instead of 2ms, it never returns a number of iterations in the smaller range after the 10th call, so, again, the behavior disappears after 40ms (likewise if have runInPlace() run for only 1ms; the behavior disappears after the 40th call).
Here's my code:
#include "stdafx.h"
#include "Windows.h"
int runInPlace(int msDelay)
{
LARGE_INTEGER t0, t1;
int n = 0;
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
n++;
} while (t1.QuadPart - t0.QuadPart < msDelay);
return n;
}
int _tmain(int argc, _TCHAR* argv[])
{
LARGE_INTEGER t0, t1;
LARGE_INTEGER frequency;
int n;
QueryPerformanceFrequency(&frequency);
int msDelay = 2 * frequency.QuadPart / 1000;
int spinDelay = 33 * frequency.QuadPart / 1000;
for (int i = 0; i < 100; i++)
{
if (argc > 1)
Sleep(33);
else
{
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
} while (t1.QuadPart - t0.QuadPart < spinDelay);
}
n = runInPlace(msDelay);
printf("%d \n", n);
}
getchar();
return 0;
}
Here's some output typical of what I get when using Sleep() for the delay:
56116
248936
53659
34311
233488
54921
47904
45765
31454
55633
55870
55607
32363
219810
211400
216358
274039
244635
152282
151779
43057
37442
251658
53813
56237
259858
252275
251099
And here's some output typical of what I get when I spin to create the delay:
276461
280869
276215
280850
188066
280666
281139
280904
277886
279250
244671
240599
279697
280844
159246
271938
263632
260892
238902
255570
265652
274005
273604
150640
279153
281146
280845
248277
Can anyone help me understand this behavior? (Note, I have tried this program, compiled with Visual C++ 2010 Express, on five computers. It only shows this behavior on the two fastest machines I have.)

This sounds like it is due to the reduced clock speed that the CPU will run at when the computer is not busy (SpeedStep). When the computer is idle (like in a sleep) the clock speed will drop to reduce power consumption. On newer CPUs this can be 35% or less of the listed clock speed. Once the computer gets busy again there is a small delay before the CPU will speed up again.
You can turn off this feature (either in the BIOS or by changing the "Minimum processor state" setting under "Processor power management" in the advanced settings of your power plan to 100%.

Besides what #1201ProgramAlarm said (which may very well be, modern processors are extremely fond of downclocking whenever they can), it may also be a cache warming up problem.
When you ask to sleep for a while the scheduler typically schedules another thread/process for the next CPU time quantum, which means that the caches (instruction cache, data cache, TLB, branch predictor data, ...) relative to your process are going to be "cold" again when your code regains the CPU.

Related

lowest latency/high resolution, highest timing guarantee, timing/timer on windows? [duplicate]

I am making a program using the Sleep command via Windows.h, and am experiencing a frustrating difference between running my program on Windows 10 instead of Windows 7. I simplified my program to the program below which exhibits the same behavior as my more complicated program.
On Windows 7 this 5000 count loop runs with the Sleep function at 1ms. This takes 5 seconds to complete.
On Windows 10 when I run the exact same program (exact same binary executable file), this program takes almost a minute to complete.
For my application this is completely unacceptable as I need to have the 1ms timing delay in order to interact with hardware I am using.
I also tried a suggestion from another post to use the select() command (via winsock2), but that command did not work to delay 1ms either. I have tried this program on multiple Windows 7 and Windows 10 PC's and the root cause of the issue always points to using Windows 10 instead of Windows 7. The program always runs within ~5 seconds on numerous Windows 7 PC's, and on the multiple Windows 10 PC's that I have tested the duration has been much longer ~60 seconds.
I have been using Microsoft Visual Studio Express 2010 (C/C++) as well as Microsoft Visual Studio Express 2017 (C/C++) to compile the programs. The version of visual studio does not influence the results.
I have also changed the compile options from 'Debug' to 'Release' and tried to optimize the compiler but this will not help either.
Any suggestions would be greatly appreciated.
#include <stdio.h>
#include <Windows.h>
#define LOOP_COUNT 5000
int main()
{
int i = 0;
for (i; i < LOOP_COUNT; i++){
Sleep(1);
}
return 0;
}
I need to have the 1ms timing delay in order to interact with hardware I am using
Windows is the wrong tool for this job.
If you insist on using this wrong tool, you are going to have to make compromises (such as using a busy-wait and accepting the corresponding poor battery life).
You can make Sleep() more accurate using timeBeginPeriod(1) but depending on your hardware peripheral's limits on the "one millisecond" delay -- is that a minimum, maximum, or the middle of some range? -- it still will fail to meet your timing requirement with some non-zero probability.
The timeBeginPeriod function requests a minimum resolution for periodic timers.
The right solution for talking to hardware with tight timing tolerances is an embedded microcontroller which talks to the Windows PC through some very flexible interface such as UART or Ethernet, buffers data, and uses hardware timers to generate signals with very well-defined timing.
In some cases, you might be able to use embedded circuitry already existing within your Windows PC, such as "sound card" functionality.
#BenVoigt & #mzimmers thank you for your responses and suggestions. I did find a unique solution to this question and the solution was inspired by the post I have linked directly below.
Units of QueryPerformanceFrequency
In this post BrianP007 writes a function to see how fast the Sleep(1000) command takes. However, while I was playing around I realized that Sleep() accepts 0. Therefore I used a similar structure to the linked post to find the time that it takes to loop until reaching a delta t of 1ms.
For my purposes I increased i by 100, however it can be increased by 10 or by 1 in order to get a more accurate estimate as to what i should be.
Once you get a value for i, you can use that value to get an approximate delay for 1ms on your machine. If you run this function in a loop (I ran it 100 times) I was able to get anywhere from i = 3000 to i = 6000. However, my machine averages out around 5500. This spread is probably due to jitter/clock frequency changes through time in the processor.
The processor_check() function below only finds out what value should be returned for the for loop argument; the actual 'timer' needs to just have the for loop with Sleep(0) inside of it to run a timer with ~1ms resolution on the machine.
While this method is not perfect, it is much closer and works a ton better than using Sleep(1). I have to test this more thoroughly, but please let me know if this works for you as well. Please feel free to use the code below if you need it for your own applications. This code should be able to be copy and pasted into an empty command prompt C program in Visual Studio directly without modification.
/*ZKR Sleep_ZR()*/
#include "stdio.h"
#include <windows.h>
/*Gets for loop value*/
int processor_check()
{
double delta_time = 0;
int i = 0;
int n = 0;
while(delta_time < 0.001){
LARGE_INTEGER sklick, eklick, cpu_khz;
QueryPerformanceFrequency(&cpu_khz);
QueryPerformanceCounter(&sklick);
for(n = 0; n < i; n++){
Sleep(0);
}
QueryPerformanceCounter(&eklick);
delta_time = (eklick.QuadPart-sklick.QuadPart) / (double)cpu_khz.QuadPart;
i = i + 100;
}
return i;
}
/*Timer*/
void Sleep_ZR(int cnt)
{
int i = 0;
for(i; i < cnt; i++){
Sleep(0);
}
}
/*Main*/
int main(int argc, char** argv)
{
double average = 0;
int i = 0;
/*Single use*/
int loop_count = processor_check();
Sleep_ZR(loop_count);
/*Average based on processor to get more accurate Sleep_ZR*/
for(i = 0; i < 100; i++){
loop_count = processor_check();
average = average + loop_count;
}
average = average / 100;
printf("Average: %f\n", average);
/*10 second test*/
for (i = 0; i < 10000; i++){
Sleep_ZR((int)average);
}
return 0;
}

C++ Hot loop makes timing and function accurate but takes 20% CPU

Hey guys so this is my first question of Stack Overflow so if I've done something wrong my bad.
I have my program which is designed to make precise mouse movements at specific times, and it calculates the timing using a few hard coded variables and a timing function, which is running in microseconds for accuracy. The program works perfectly as intended, and makes the correct movements at the correct timing etc.
Only problem is, that the sleeping function I am using is a hot loop (as in, its a while loop without a sleep), so when the program is executing the movements, it can take up to 20% CPU usage. The context of this is in a game, and can drop FPS in game from 60 down to 30 with lots of stuttering, making the game unplayable.
I am still learning c++, so any help is greatly appreciated. Below is some snippets of my code to show what I am trying to explain.
this is where the sleep is called for some context
void foo(paramenters and stuff not rly important)
{
Code is doing a bunch of movement stuff here not important blah blah
//after doing its first movement of many, it calls this sleep function from if statement (Time::Sleep) so it knows how long to sleep before executing the next movement.
if (repeat_delay - animation > 0) Time::Sleep(repeat_delay - animation, excess);
}
Now here is the actual sleeping function, which, after using the visual studio performance debugger I can see is using all my resources. All of the parameters in this function are accounted for already, like I said before, the code works perfectly, apart from performance.
#include "Time.hpp"
#include <windows.h>
namespace Time
{
void Sleep(int64_t sleep_ms, std::chrono::time_point<std::chrono::steady_clock> start)
{
sleep_ms *= 1000;
auto truncated = (sleep_ms - std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - start).count()) / 1000;
while (std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - start).count() < sleep_ms)
{
if (truncated)
{
std::this_thread::sleep_for(std::chrono::milliseconds(truncated));
truncated = 0;
}
/*
I have attempted putting even a 1 microsecond sleep in here, which brings CPU usage down to
0.5%
which is great, but my movements slowed right down, and even after attempting to speed up
the movements manually by altering a the movement functions mouse speed variable, it just
makes the movements inaccurate. How can I improve performance here without sacrificing
accuracy
*/
}
}
}
Why did you write a sleep function? Just use std::this_thread::sleep_for as it doesn't use any resources and is reasonably accurate.
Its' accuracy might depend on platform. On my Windows 10 PC it is accurate within 1 millisecond which should be suitable for durations over 10ms (= 100fps).

I'm looking to improve or request my current delay / sleep method. c++

Currently I am coding a project that requires precise delay times over a number of computers. Currently this is the code I am using I found it on a forum. This is the code below.
{
LONGLONG timerResolution;
LONGLONG wantedTime;
LONGLONG currentTime;
QueryPerformanceFrequency((LARGE_INTEGER*)&timerResolution);
timerResolution /= 1000;
QueryPerformanceCounter((LARGE_INTEGER*)&currentTime);
wantedTime = currentTime / timerResolution + ms;
currentTime = 0;
while (currentTime < wantedTime)
{
QueryPerformanceCounter((LARGE_INTEGER*)&currentTime);
currentTime /= timerResolution;
}
}
Basically the issue I am having is this uses alot of CPU around 16-20% when I start to call on the function. The usual Sleep(); uses Zero CPU but it is extremely inaccurate from what I have read from multiple forums is that's the trade-off when you trade accuracy for CPU usage but I thought I better raise the question before I set for this sleep method.
The reason why it's using 15-20% CPU is likely because it's using 100% on one core as there is nothing in this to slow it down.
In general, this is a "hard" problem to solve as PCs (more specifically, the OSes running on those PCs) are in general not made for running real time applications. If that is absolutely desirable, you should look into real time kernels and OSes.
For this reason, the guarantee that is usually made around sleep times is that the system will sleep for atleast the specified amount of time.
If you are running Linux you could try using the nanosleep method (http://man7.org/linux/man-pages/man2/nanosleep.2.html) Though I don't have any experience with it.
Alternatively you could go with a hybrid approach where you use sleeps for long delays, but switch to polling when it's almost time:
#include <thread>
#include <chrono>
using namespace std::chrono_literals;
...
wantedtime = currentTime / timerResolution + ms;
currentTime = 0;
while(currentTime < wantedTime)
{
QueryPerformanceCounter((LARGE_INTEGER*)&currentTime);
currentTime /= timerResolution;
if(currentTime-wantedTime > 100) // if waiting for more than 100 ms
{
//Sleep for value significantly lower than the 100 ms, to ensure that we don't "oversleep"
std::this_thread::sleep_for(50ms);
}
}
Now this is a bit race condition prone, as it assumes that the OS will hand back control of the program within 50ms after the sleep_for is done. To further combat this you could turn it down (to say, sleep 1ms).
You can set the Windows timer resolution to minimum (usually 1 ms), to make Sleep() accurate up to 1 ms. By default it would be accurate up to about 15 ms. Sleep() documentation.
Note that your execution can be delayed if other programs are consuming CPU time, but this could also happen if you were waiting with a timer.
#include <timeapi.h>
// Sleep() takes 15 ms (or whatever the default is)
Sleep(1);
TIMECAPS caps_;
timeGetDevCaps(&caps_, sizeof(caps_));
timeBeginPeriod(caps_.wPeriodMin);
// Sleep() now takes 1 ms
Sleep(1);
timeEndPeriod(caps_.wPeriodMin);

WinAPI calling code at specific timestamp

Using functions available in the WinAPI, is it possible to ensure that a specific function is called according to a milisecond precise timestamp? And if so what would be the correct implementation?
I'm trying to write tool assisted speedrun software. This type of software sends user input commands at very exact moments after the script is launched to perform humanly impossible inputs that allow faster completion of videogames. A typical sequence looks something like this:
At 0 miliseconds send right key down event
At 5450 miliseconds send right key up, and up key down event
At 5460 miliseconds send left key down event
etc..
What I've tried so far is listed below. As I'm not experienced in the low level nuances of high precision timers I have some results, but no understanding of why they are this way:
Using Sleep in combination with timeBeginPeriod set to 1 between inputs gave the worst results. Out of 20 executions 0 have met the timing requirement. I believe this is well explained in the documentation for sleep Note that a ready thread is not guaranteed to run immediately. Consequently, the thread may not run until some time after the sleep interval elapses. My understanding is that Sleep isn't up for this task.
Using a busy wait loop checking GetTickCount64 with timeBeginPeriod set to 1 produced slightly better results. Out of 20 executions 2 have met the timing requirement, but apparently that was just a fortunate circumstance. I've looked up some info on this timing function and my suspicion is that it doesn't update often enough to allow 1 milisecond accuracy.
Replacing the GetTickCount64 with the QueryPerformanceCounter improved the situation slightly. Out of 20 executions 8 succeded. I wrote a logger that would store the QPC timestamps right before each input is sent and dump the values in a file after the sequence is finished. I even went as far as to preallocate space for all variables in my code to make sure that time isn't wasted on needless explicit memory allocations. The log values diverge from the timestamps I supply the program by anything from 1 to 40 miliseconds. General purpose programming can live with that, but in my case a single frame of the game is 16.7 ms, so in the worst possible case with delays like these I can be 3 frames late, which defeats the purpose of the whole experiment.
Setting the process priority to high didn't make any difference.
At this point I'm not sure where to look next. My two guesses are that maybe the time that it takes to iterate the busy loop and check the time using (QPCNow - QPCStart) / QPF is itself somehow long enough to introduce the mentioned delay, or that the process is interrupted by the OS scheduler somwhere along the execution of the loop and control returns too late.
The game is 100% deterministic and locked at 60 fps. I am convinced that if I manage to make the input be timed accurately the result will always be 20 out of 20, but at this point I'm begining to suspect that this may not be possible.
EDIT: As per request here is a stripped down testing version. Breakpoint after the second call to ExecuteAtTime and view the TimeBeforeInput variables. For me it reads 1029 and 6017(I've omitted the decimals) meaning that the code executed 29 and 17 miliseconds after it should have.
Disclaimer: the code is not written to demonstrate good programming practices.
#include "stdafx.h"
#include <windows.h>
__int64 g_TimeStart = 0;
double g_Frequency = 0.0;
double g_TimeBeforeFirstInput = 0.0;
double g_TimeBeforeSecondInput = 0.0;
double GetMSSinceStart(double& debugOutput)
{
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
debugOutput = double(now.QuadPart - g_TimeStart) / g_Frequency;
return debugOutput;
}
void ExecuteAtTime(double ms, INPUT* keys, double& debugOutput)
{
while(GetMSSinceStart(debugOutput) < ms)
{
}
SendInput(2, keys, sizeof(INPUT));
}
INPUT* InitKeys()
{
INPUT* result = new INPUT[2];
ZeroMemory(result, 2*sizeof(INPUT));
INPUT winKey;
winKey.type = INPUT_KEYBOARD;
winKey.ki.wScan = 0;
winKey.ki.time = 0;
winKey.ki.dwExtraInfo = 0;
winKey.ki.wVk = VK_LWIN;
winKey.ki.dwFlags = 0;
result[0] = winKey;
winKey.ki.dwFlags = KEYEVENTF_KEYUP;
result[1] = winKey;
return result;
}
int _tmain(int argc, _TCHAR* argv[])
{
INPUT* keys = InitKeys();
LARGE_INTEGER qpf;
QueryPerformanceFrequency(&qpf);
g_Frequency = double(qpf.QuadPart) / 1000.0;
LARGE_INTEGER qpcStart;
QueryPerformanceCounter(&qpcStart);
g_TimeStart = qpcStart.QuadPart;
//Opens windows start panel one second after launch
ExecuteAtTime(1000.0, keys, g_TimeBeforeFirstInput);
//Closes windows start panel 5 seconds later
ExecuteAtTime(6000.0, keys, g_TimeBeforeSecondInput);
delete[] keys;
Sleep(1000);
return 0;
}

Busy Loop/Spinning sometimes takes too long under Windows

I'm using a windows 7 PC to output voltages at a rate of 1kHz. At first I simply ended the thread with sleep_until(nextStartTime), however this has proven to be unreliable, sometimes working fine and sometimes being of by up to 10ms.
I found other answers here saying that a busy loop might be more accurate, however mine for some reason also sometimes takes too long.
while (true) {
doStuff(); //is quick enough
logDelays();
nextStartTime = chrono::high_resolution_clock::now() + chrono::milliseconds(1);
spinStart = chrono::high_resolution_clock::now();
while (chrono::duration_cast<chrono::microseconds>(nextStartTime -
chrono::high_resolution_clock::now()).count() > 200) {
spinCount++; //a volatile int
}
int spintime = chrono::duration_cast<chrono::microseconds>
(chrono::high_resolution_clock::now() - spinStart).count();
cout << "Spin Time micros :" << spintime << endl;
if (spinCount > 100000000) {
cout << "reset spincount" << endl;
spinCount = 0;
}
}
I was hoping that this would work to fix my issue, however it produces the output:
Spin Time micros :9999
Spin Time micros :9999
...
I've been stuck on this problem for the last 5 hours and I'd very thankful if somebody knows a solution.
According to the comments this code waits correctly:
auto start = std::chrono::high_resolution_clock::now();
const auto delay = std::chrono::milliseconds(1);
while (true) {
doStuff(); //is quick enough
logDelays();
auto spinStart = std::chrono::high_resolution_clock::now();
while (start > std::chrono::high_resolution_clock::now() + delay) {}
int spintime = std::chrono::duration_cast<std::chrono::microseconds>
(std::chrono::high_resolution_clock::now() - spinStart).count();
std::cout << "Spin Time micros :" << spintime << std::endl;
start += delay;
}
The important part is the busy-wait while (start > std::chrono::high_resolution_clock::now() + delay) {} and start += delay; which will in combination make sure that delay amount of time is waited, even when outside factors (windows update keeping the system busy) disturb it. In case that the loop takes longer than delay the loop will be executed without waiting until it catches up (which may be never if doStuff is sufficiently slow).
Note that missing an update (due to the system being busy) and then sending 2 at once to catch up might not be the best way to handle the situation. You may want to check the current time inside doStuff and abort/restart the transmission if the timing is wrong by more then some acceptable amount.
On Windows I dont think its possible to ever get such precise timing, because you can not garuntee your thread is actually running at the time you desire. Even with low CPU usage and setting your thread to real time priority, it can still be interuptted (Hardware interupts as I understand. Never fully investigate but even a simple while(true) ++i; type loop at realtime Ive seen get interupted then moved between CPU cores). While such interrupts and switching for a realtime thread is very quick, its still significant if your trying to directly drive a signal without buffering.
Instead you really want to read and write buffers of digital samples (so at 1KHz each sample is 1ms). You need to be sure to queue another buffer before the last one is completed, which will constrain how small they can be, but at 1KHz at realtime priority if the code is simple and no other CPU contention a single sample buffer (1ms) might even be possible, which is at worst 1ms extra latency over "immediate" but you would have to test. You then leave it up to the hardware and its drivers to handle the precise timing (e.g. make sure each output sample is "exactly" 1ms to the accuracy the vendor claims).
This basically means your code only has to be accurate to 1ms in worst case, rather than trying to persue somthing far smaller than the OS really supports such as microsecond accuracy.
As long as you are able to queue a new buffer before the hardware used up the previous buffer, it will be able to run at the desired frequency without issue (to use audio as an example again, while the tolerated latencies are often much higher and thus the buffers as well, if you overload the CPU you can still sometimes hear auidble glitches where an application didnt queue up new raw audio in time).
With careful timing you might even be able to get down to a fraction of a millisecond by waiting to process and queue your next sample as long as possible (e.g. if you need to reduce latency between input and output), but remember that the closer you cut it the more you risk submitting it too late.