C++ Hot loop makes timing and function accurate but takes 20% CPU

C++ Hot loop makes timing and function accurate but takes 20% CPU - c++

Hey guys so this is my first question of Stack Overflow so if I've done something wrong my bad.
I have my program which is designed to make precise mouse movements at specific times, and it calculates the timing using a few hard coded variables and a timing function, which is running in microseconds for accuracy. The program works perfectly as intended, and makes the correct movements at the correct timing etc.
Only problem is, that the sleeping function I am using is a hot loop (as in, its a while loop without a sleep), so when the program is executing the movements, it can take up to 20% CPU usage. The context of this is in a game, and can drop FPS in game from 60 down to 30 with lots of stuttering, making the game unplayable.
I am still learning c++, so any help is greatly appreciated. Below is some snippets of my code to show what I am trying to explain.
this is where the sleep is called for some context
void foo(paramenters and stuff not rly important)
{
Code is doing a bunch of movement stuff here not important blah blah
//after doing its first movement of many, it calls this sleep function from if statement (Time::Sleep) so it knows how long to sleep before executing the next movement.
if (repeat_delay - animation > 0) Time::Sleep(repeat_delay - animation, excess);
}
Now here is the actual sleeping function, which, after using the visual studio performance debugger I can see is using all my resources. All of the parameters in this function are accounted for already, like I said before, the code works perfectly, apart from performance.
#include "Time.hpp"
#include <windows.h>
namespace Time
{
void Sleep(int64_t sleep_ms, std::chrono::time_point<std::chrono::steady_clock> start)
{
sleep_ms *= 1000;
auto truncated = (sleep_ms - std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - start).count()) / 1000;
while (std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - start).count() < sleep_ms)
{
if (truncated)
{
std::this_thread::sleep_for(std::chrono::milliseconds(truncated));
truncated = 0;
}
/*
I have attempted putting even a 1 microsecond sleep in here, which brings CPU usage down to
0.5%
which is great, but my movements slowed right down, and even after attempting to speed up
the movements manually by altering a the movement functions mouse speed variable, it just
makes the movements inaccurate. How can I improve performance here without sacrificing
accuracy
*/
}
}
}

Why did you write a sleep function? Just use std::this_thread::sleep_for as it doesn't use any resources and is reasonably accurate.
Its' accuracy might depend on platform. On my Windows 10 PC it is accurate within 1 millisecond which should be suitable for durations over 10ms (= 100fps).

Related

Busy Loop/Spinning sometimes takes too long under Windows

I'm using a windows 7 PC to output voltages at a rate of 1kHz. At first I simply ended the thread with sleep_until(nextStartTime), however this has proven to be unreliable, sometimes working fine and sometimes being of by up to 10ms.
I found other answers here saying that a busy loop might be more accurate, however mine for some reason also sometimes takes too long.
while (true) {
doStuff(); //is quick enough
logDelays();
nextStartTime = chrono::high_resolution_clock::now() + chrono::milliseconds(1);
spinStart = chrono::high_resolution_clock::now();
while (chrono::duration_cast<chrono::microseconds>(nextStartTime -
chrono::high_resolution_clock::now()).count() > 200) {
spinCount++; //a volatile int
}
int spintime = chrono::duration_cast<chrono::microseconds>
(chrono::high_resolution_clock::now() - spinStart).count();
cout << "Spin Time micros :" << spintime << endl;
if (spinCount > 100000000) {
cout << "reset spincount" << endl;
spinCount = 0;
}
}
I was hoping that this would work to fix my issue, however it produces the output:
Spin Time micros :9999
Spin Time micros :9999
...
I've been stuck on this problem for the last 5 hours and I'd very thankful if somebody knows a solution.

According to the comments this code waits correctly:
auto start = std::chrono::high_resolution_clock::now();
const auto delay = std::chrono::milliseconds(1);
while (true) {
doStuff(); //is quick enough
logDelays();
auto spinStart = std::chrono::high_resolution_clock::now();
while (start > std::chrono::high_resolution_clock::now() + delay) {}
int spintime = std::chrono::duration_cast<std::chrono::microseconds>
(std::chrono::high_resolution_clock::now() - spinStart).count();
std::cout << "Spin Time micros :" << spintime << std::endl;
start += delay;
}
The important part is the busy-wait while (start > std::chrono::high_resolution_clock::now() + delay) {} and start += delay; which will in combination make sure that delay amount of time is waited, even when outside factors (windows update keeping the system busy) disturb it. In case that the loop takes longer than delay the loop will be executed without waiting until it catches up (which may be never if doStuff is sufficiently slow).
Note that missing an update (due to the system being busy) and then sending 2 at once to catch up might not be the best way to handle the situation. You may want to check the current time inside doStuff and abort/restart the transmission if the timing is wrong by more then some acceptable amount.

On Windows I dont think its possible to ever get such precise timing, because you can not garuntee your thread is actually running at the time you desire. Even with low CPU usage and setting your thread to real time priority, it can still be interuptted (Hardware interupts as I understand. Never fully investigate but even a simple while(true) ++i; type loop at realtime Ive seen get interupted then moved between CPU cores). While such interrupts and switching for a realtime thread is very quick, its still significant if your trying to directly drive a signal without buffering.
Instead you really want to read and write buffers of digital samples (so at 1KHz each sample is 1ms). You need to be sure to queue another buffer before the last one is completed, which will constrain how small they can be, but at 1KHz at realtime priority if the code is simple and no other CPU contention a single sample buffer (1ms) might even be possible, which is at worst 1ms extra latency over "immediate" but you would have to test. You then leave it up to the hardware and its drivers to handle the precise timing (e.g. make sure each output sample is "exactly" 1ms to the accuracy the vendor claims).
This basically means your code only has to be accurate to 1ms in worst case, rather than trying to persue somthing far smaller than the OS really supports such as microsecond accuracy.
As long as you are able to queue a new buffer before the hardware used up the previous buffer, it will be able to run at the desired frequency without issue (to use audio as an example again, while the tolerated latencies are often much higher and thus the buffers as well, if you overload the CPU you can still sometimes hear auidble glitches where an application didnt queue up new raw audio in time).
With careful timing you might even be able to get down to a fraction of a millisecond by waiting to process and queue your next sample as long as possible (e.g. if you need to reduce latency between input and output), but remember that the closer you cut it the more you risk submitting it too late.

Switching an image at specific frequencies c++

I am currently developing a stimuli provider for the brain's visual cortex as a part of a university project. The program is to (preferably) be written in c++, utilising visual studio and OpenCV. The way it is supposed to work is that the program creates a number of threads, accordingly to the amount of different frequencies, each running a timer for their respective frequency.
The code looks like this so far:
void timerThread(void *param) {
t *args = (t*)param;
int id = args->data1;
float freq = args->data2;
unsigned long period = round((double)1000 / (double)freq)-1;
while (true) {
Sleep(period);
show[id] = 1;
Sleep(period);
show[id] = 0;
}
}
It seems to work okay for some of the frequencies, but others vary quite a lot in frame rate. I have tried to look into creating my own timing function, similar to what is done in Arduino's "blinkWithoutDelay" function, though this worked very badly. Also, I have tried with the waitKey() function, this worked quite like the Sleep() function used now.
Any help would be greatly appreciated!

You should use timers instead of "sleep" to fix this, as sometimes the loop may take more or less time to complete.
Restart the timer at the start of the loop and take its value right before the reset- this'll give you the time it took for the loop to complete.
If this time is greater than the "period" value, then it means you're late, and you need to execute right away (and even lower the period for the next loop).
Otherwise, if it's lower, then it means you need to wait until it is greater.
I personally dislike sleep, and instead constantly restart the timer until it's greater.
I suggest looking into "fixed timestep" code, such as the one below. You'll need to put this snippet of code on every thread with varying values for the period (ns) and put your code where "doUpdates()" is.
If you need a "timer" library, since I don't know OpenCV, I recommend SFML (SFML's timer docs).
The following code is from here:
long int start = 0, end = 0;
double delta = 0;
double ns = 1000000.0 / 60.0; // Syncs updates at 60 per second (59 - 61)
while (!quit) {
start = timeAsMicro();
delta+=(double)(start - end) / ns; // You can skip dividing by ns here and do "delta >= ns" below instead //
end = start;
while (delta >= 1.0) {
doUpdates();
delta-=1.0;
}
}
Please mind the fact that in this code, the timer is never reset.
(This may not be completely accurate but is the best assumption I can make to fix your problem given the code you've presented)

Why does Sleep() slow down subsequent code for 40ms?

I originally asked about this at coderanch.com, so if you've tried to assist me there, thanks, and don't feel obliged to repeat the effort. coderanch.com is mostly a Java community, though, and this appears (after some research) to really be a Windows question, so my colleagues there and I thought this might be a more appropriate place to look for help.
I have written a short program that either spins on the Windows performance counter until 33ms have passed, or else calls Sleep(33). The former exhibits no unexpected effects, but the latter appears to (inconsistently) slow subsequent processing for about 40ms (either that, or it has some effect on the values returned from the performance counter for that long). After the spin or Sleep(), the program calls a routine, runInPlace(), that spins for 2ms, counting the number of times it queries the performance counter, and returning that number.
When the initial 33ms delay is done by spinning, the number of iterations of runInPlace() tends to be (on my Windows 10, XPS-8700) about 250,000. It varies, probably due to other system overhead, but it varies smoothing around 250,000.
Now, when the initial delay is done by calling Sleep(), something strange happens. A lot of the calls to runInPlace() return a number near 250,000, but quite a few of them return a number near 50,000. Again, the range varies around 50,000, fairly smoothly. But, it is clearly averaging one or the other, with nearly no returns anywhere between 80,000 and 150,000. If I call runInPlace() 100 times after each delay, instead of just once, it never returns a number of iterations in the smaller range after the 20th call. As runInPlace() runs for 2ms, this means the behavior I'm observing disappears after 40ms. If I have runInPlace() run for 4ms instead of 2ms, it never returns a number of iterations in the smaller range after the 10th call, so, again, the behavior disappears after 40ms (likewise if have runInPlace() run for only 1ms; the behavior disappears after the 40th call).
Here's my code:
#include "stdafx.h"
#include "Windows.h"
int runInPlace(int msDelay)
{
LARGE_INTEGER t0, t1;
int n = 0;
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
n++;
} while (t1.QuadPart - t0.QuadPart < msDelay);
return n;
}
int _tmain(int argc, _TCHAR* argv[])
{
LARGE_INTEGER t0, t1;
LARGE_INTEGER frequency;
int n;
QueryPerformanceFrequency(&frequency);
int msDelay = 2 * frequency.QuadPart / 1000;
int spinDelay = 33 * frequency.QuadPart / 1000;
for (int i = 0; i < 100; i++)
{
if (argc > 1)
Sleep(33);
else
{
QueryPerformanceCounter(&t0);
do
{
QueryPerformanceCounter(&t1);
} while (t1.QuadPart - t0.QuadPart < spinDelay);
}
n = runInPlace(msDelay);
printf("%d \n", n);
}
getchar();
return 0;
}
Here's some output typical of what I get when using Sleep() for the delay:
56116
248936
53659
34311
233488
54921
47904
45765
31454
55633
55870
55607
32363
219810
211400
216358
274039
244635
152282
151779
43057
37442
251658
53813
56237
259858
252275
251099
And here's some output typical of what I get when I spin to create the delay:
276461
280869
276215
280850
188066
280666
281139
280904
277886
279250
244671
240599
279697
280844
159246
271938
263632
260892
238902
255570
265652
274005
273604
150640
279153
281146
280845
248277
Can anyone help me understand this behavior? (Note, I have tried this program, compiled with Visual C++ 2010 Express, on five computers. It only shows this behavior on the two fastest machines I have.)

This sounds like it is due to the reduced clock speed that the CPU will run at when the computer is not busy (SpeedStep). When the computer is idle (like in a sleep) the clock speed will drop to reduce power consumption. On newer CPUs this can be 35% or less of the listed clock speed. Once the computer gets busy again there is a small delay before the CPU will speed up again.
You can turn off this feature (either in the BIOS or by changing the "Minimum processor state" setting under "Processor power management" in the advanced settings of your power plan to 100%.

Besides what #1201ProgramAlarm said (which may very well be, modern processors are extremely fond of downclocking whenever they can), it may also be a cache warming up problem.
When you ask to sleep for a while the scheduler typically schedules another thread/process for the next CPU time quantum, which means that the caches (instruction cache, data cache, TLB, branch predictor data, ...) relative to your process are going to be "cold" again when your code regains the CPU.

Is event recording on time-sensitive possible?

Basic Question
Is there any way to event recording an playback within a time-sensitive (framerate independent) system?
Any help - including a simple "No sorry it is impossible" - would be greatly appreciated. I have spent almost 20 hours working on this over the past few weekends and am driving myself crazy.
Full Details
This is being currently aimed at a game but the libraries I'm writing are designed to be more general and this concept applies to more than just my C++ coding.
I have some code that looks functionally similar this... (it is written in C++0x but I'm taking some liberties to make it more compact)
void InputThread()
{
InputAxisReturn AxisState[IA_SIZE];
while (Continue)
{
Threading()->EventWait(InputEvent);
Threading()->EventReset(InputEvent);
pInput->GetChangedAxis(AxisState);
//REF ALPHA
if (AxisState[IA_GAMEPAD_0_X].Changed)
{
X_Axis = AxisState[IA_GAMEPAD_0_X].Value;
}
}
}
And I have a separate thread that looks like this...
//REF BETA
while (Continue)
{
//Is there a message to process?
StandardWindowsPeekMessageProcessing();
//GetElapsedTime() returns a float of time in seconds since its last call
UpdateAll(LoopTimer.GetElapsedTime());
}
Now I'd like to record input events for playback for testing and some limited replay functionality.
I can easily record the events with precision timing by simply inserting the following code where I marked //REF ALPHA
//REF ALPHA
EventRecordings.pushback(EventRecording(TimeSinceRecordingBegan, AxisState));
The real issue is playing these back. My LoopTimer is extremely high precision using the High Performance Counter (QueryPreformanceCounter). This means that it is nearly impossible to hit the same time difference using code like below in place of //REF BETA
// REF BETA
NextEvent = EventRecordings.pop_back()
Time TimeSincePlaybackBegan;
while (Continue)
{
//Is there a message to process?
StandardWindowsPeekMessageProcessing();
//Did we reach the next event?
if (TimeSincePlaybackBegan >= NextEvent.TimeSinceRecordingBegan)
{
if (NextEvent.AxisState[IA_GAMEPAD_0_X].Changed)
{
X_Axis = NextEvent.AxisState[IA_GAMEPAD_0_X].Value;
}
NextEvent = EventRecordings.pop_back();
}
//GetElapsedTime() returns a float of time in seconds since its last call
Time elapsed = LoopTimer.GetElapsedTime()
UpdateAll(elapsed);
TimeSincePlabackBegan += elapsed;
}
The issue with this approach is that you will almost never hit the exact same time so you will have a few microseconds where the playback doesn't match the recording.
I also tried event snapping. Kind of a confusing term but basically if the TimeSincePlaybackBegan > NextEvent.TimeSinceRecordingBegan then TimeSincePlaybackBegan = NextEvent.TimeSinceRecordingBegan and ElapsedTime was altered to suit.
It had some interesting side effects which you would expect (like some slowdown) but it unfortunately still resulted in the playback de-synchronizing.
For some more background - and possibly a reason why my time snapping approach didn't work - I'm using BulletPhysics somewhere down that UpdateAll call. Kind of like this...
void Update(float diff)
{
static const float m_FixedTimeStep = 0.005f;
static const uint32 MaxSteps = 200;
//Updates the fps
cWorldInternal::Update(diff);
if (diff > MaxSteps * m_FixedTimeStep)
{
Warning("cBulletWorld::Update() diff > MaxTimestep. Determinism will be lost");
}
pBulletWorld->stepSimulation(diff, MaxSteps, m_FixedTimeStep);
}
But I also tried with pBulletWorkd->stepSimulation(diff, 0, 0) which according to http://www.bulletphysics.org/mediawiki-1.5.8/index.php/Stepping_the_World should have done the trick but still with no avail.

Answering my own question for anyone else who stumbles upon this.
Basically if you want deterministic recording and playback you need to lock the frame-rate. If the system cannot handle the frame-rate you must introduce slowdown or risk dsyncronization.
After two weeks of additional research I've decided it is just not possible due to floating point inadequacies and the fact that floating point numbers are not necessarily associative.
The only solution to have a deterministic engine that relies on floating point numbers is to have a stable and fixed frame-rate. Any change in the frame-rate will - across a long term - result in the playback becoming desynchronized.

ASCII DOS Games - Rendering methods

I'm writing an old school ASCII DOS-Prompt game. Honestly I'm trying to emulate ZZT to learn more about this brand of game design (Even if it is antiquated)
I'm doing well, got my full-screen text mode to work and I can create worlds and move around without problems BUT I cannot find a decent timing method for my renders.
I know my rendering and pre-rendering code is fast because if I don't add any delay()s or (clock()-renderBegin)/CLK_TCK checks from time.h the renders are blazingly fast.
I don't want to use delay() because it is to my knowledge platform specific and on top of that I can't run any code while it delays (Like user input and processing). So I decided to do something like this:
do {
if(kbhit()) {
input = getch();
processInput(input);
}
if(clock()/CLOCKS_PER_SEC-renderTimer/CLOCKS_PER_SEC > RenderInterval) {
renderTimer = clock();
render();
ballLogic();
}
}while(input != 'p');
Which should in "theory" work just fine. The problem is that when I run this code (setting the RenderInterval to 0.0333 or 30fps) I don't get ANYWHERE close to 30fps, I get more like 18 at max.
I thought maybe I'd try setting the RenderInterval to 0.0 to see if the performance kicked up... it did not. I was (with a RenderInterval of 0.0) getting at max ~18-20fps.
I though maybe since I'm continuously calling all these clock() and "divide this by that" methods I was slowing the CPU down something scary, but when I took the render and ballLogic calls out of the if statement's brackets and set RenderInterval to 0.0 I get, again, blazingly fast renders.
This doesn't make sence to me since if I left the if check in, shouldn't it run just as slow? I mean it still has to do all the calculations
BTW I'm compiling with Borland's Turbo C++ V1.01

The best gaming experience is usually achieved by synchronizing with the vertical retrace of the monitor. In addition to providing timing, this will also make the game run smoother on the screen, at least if you have a CRT monitor connected to the computer.
In 80x25 text mode, the vertical retrace (on VGA) occurs 70 times/second. I don't remember if the frequency was the same on EGA/CGA, but am pretty sure that it was 50 Hz on Hercules and MDA. By measuring the duration of, say, 20 frames, you should have a sufficiently good estimate of what frequency you are dealing with.
Let the main loop be someting like:
while (playing) {
do whatever needs to be done for this particular frame
VSync();
}
... /* snip */
/* Wait for vertical retrace */
void VSync() {
while((inp(0x3DA) & 0x08));
while(!(inp(0x3DA) & 0x08));
}

clock()-renderTimer > RenderInterval * CLOCKS_PER_SEC
would compute a bit faster, possibly even faster if you pre-compute the RenderInterval * CLOCKS_PER_SEC part.

I figured out why it wasn't rendering right away, the timer that I created is fine the problem is that the actual clock_t is only accurate to .054547XXX or so and so I could only render at 18fps. The way I would fix this is by using a more accurate clock... which is a whole other story

What about this: you are substracting from x (=clock()) y (=renderTimer). Both x and y are being divided by CLOCKS_PER_SEC:
clock()/CLOCKS_PER_SEC-renderTimer/CLOCKS_PER_SEC > RenderInterval
Wouldn't it be mor efficiente to write:
( clock() - renderTimer ) > RenderInterval
The very first problem I saw with the division was that you're not going to get a real number from it, since it happens between two long ints. The secons problem is that it is more efficiente to multiply RenderInterval * CLOCKS_PER_SEC and this way get rid of it, simplifying the operation.
Adding the brackets gives more legibility to it. And maybe by simplifying this phormula you will get easier what's going wrong.

As you've spotted with your most recent question, you're limited by CLOCKS_PER_SEC which is only about 18. You get one frame per discrete value of clock, which is why you're limited to 18fps.
You could use the screen vertical blanking interval for timing, it's traditional for games as it avoids "tearing" (where half the screen shows one frame and half shows another)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js