Posix Timer on SCHED_RR Thread is using 100% CPU

Posix Timer on SCHED_RR Thread is using 100% CPU - c++

I have the following code snippet:
#include <iostream>
#include <thread>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/timerfd.h>
int main() {
std::thread rr_thread([](){
struct sched_param params = {5};
pthread_setschedparam(pthread_self(), SCHED_RR, &params);
struct itimerspec ts;
struct epoll_event ev;
int tfd ,epfd;
ts.it_interval.tv_sec = 0;
ts.it_interval.tv_nsec = 0;
ts.it_value.tv_sec = 0;
ts.it_value.tv_nsec = 20000; // 50 kHz timer
tfd = timerfd_create(CLOCK_MONOTONIC, 0);
timerfd_settime(tfd, 0, &ts, NULL);
epfd = epoll_create(1);
ev.events = EPOLLIN;
epoll_ctl(epfd, EPOLL_CTL_ADD, tfd, &ev);
while (true) {
epoll_wait(epfd, &ev, 1, -1); // wait forever for the timer
read(tfd, &missed, sizeof(missed));
// Here i have a blocking function (dummy in this example) which
// takes on average 15ns to execute, less than the timer period anyways
func15ns();
}
});
rr_thread.join();
}
I have a posix thread using the SCHED_RR policy and on this thread there is a POSIX Timer running with a timeout of 20000ns = 50KHz = 50000 ticks/sec.
After the timer fires i am executing a function that takes roughly 15ns so less than the timer period, but this doesn't really matter.
When i execute this i am getting 100% CPU Usage, the whole system is getting slow but i don't understand why this is happening and some things are confusing.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled in theory right? even if this is a high priority thread.
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary ones. Is this normal? While waiting for the timer to fire the scheduler should schedule other tasks right? I should see at least 20000 * 2 context switches / sec

As presented, your program does not behave as you describe. This is because you program the timer as a one-shot, not a repeating timer. For a timer that fires every 20000 ns, you want to set a 20000-ns interval:
ts.it_interval.tv_nsec = 20000;
Having modified that, I get a program that works produces heavy load on one core.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled
in theory right? even if this is a high priority thread.
Sure, your thread blocks in epoll_wait() to await timer ticks, if in fact it manages to loop back there before the timer ticks again. On my machine, your program consumes only about 30% of one core, which seems to confirm that such blocking will indeed happen. That you see 100% CPU use suggests that my computer runs the program more efficiently than yours does, for whatever reason.
But you have to appreciate that the load is very heavy. You are asking to perform all the processing of the timer itself, the epoll call, the read, and func15ns() once every 20000 ns. Yes, whatever time may be left, if any, is available to be scheduled for another task, but the task swap takes a bit more time again. 20000 ns is not very much time. Consider that just fetching a word from main memory costs about 100 ns (though reading one from cache is of course faster).
In particular, do not neglect the work other than func15ns(). If the latter indeed takes only 15 ns to run then it's the least of your worries. You're performing two system calls, and these are expensive. Just how expensive depends on a lot of factors, but consider that removing the epoll_wait() call reduces the load for me from 30% to 25% of a core (and note that the whole epoll setup is superfluous here because simply allowing the read() to block serves the purpose).
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary
ones. Is this normal? While waiting for the timer to fire the
scheduler should schedule other tasks right? I should see at least
20000 * 2 context switches / sec
You're occupying a full CPU with a high priority task, so why do you expect switching?
On the other hand, I'm also observing a low number of context switches for the process running your (modified) program, even though it's occupying only 25% of a core. I'm not prepared at the moment to reason about why that is.

Related

I'm looking to improve or request my current delay / sleep method. c++

Currently I am coding a project that requires precise delay times over a number of computers. Currently this is the code I am using I found it on a forum. This is the code below.
{
LONGLONG timerResolution;
LONGLONG wantedTime;
LONGLONG currentTime;
QueryPerformanceFrequency((LARGE_INTEGER*)&timerResolution);
timerResolution /= 1000;
QueryPerformanceCounter((LARGE_INTEGER*)&currentTime);
wantedTime = currentTime / timerResolution + ms;
currentTime = 0;
while (currentTime < wantedTime)
{
QueryPerformanceCounter((LARGE_INTEGER*)&currentTime);
currentTime /= timerResolution;
}
}
Basically the issue I am having is this uses alot of CPU around 16-20% when I start to call on the function. The usual Sleep(); uses Zero CPU but it is extremely inaccurate from what I have read from multiple forums is that's the trade-off when you trade accuracy for CPU usage but I thought I better raise the question before I set for this sleep method.

The reason why it's using 15-20% CPU is likely because it's using 100% on one core as there is nothing in this to slow it down.
In general, this is a "hard" problem to solve as PCs (more specifically, the OSes running on those PCs) are in general not made for running real time applications. If that is absolutely desirable, you should look into real time kernels and OSes.
For this reason, the guarantee that is usually made around sleep times is that the system will sleep for atleast the specified amount of time.
If you are running Linux you could try using the nanosleep method (http://man7.org/linux/man-pages/man2/nanosleep.2.html) Though I don't have any experience with it.
Alternatively you could go with a hybrid approach where you use sleeps for long delays, but switch to polling when it's almost time:
#include <thread>
#include <chrono>
using namespace std::chrono_literals;
...
wantedtime = currentTime / timerResolution + ms;
currentTime = 0;
while(currentTime < wantedTime)
{
QueryPerformanceCounter((LARGE_INTEGER*)&currentTime);
currentTime /= timerResolution;
if(currentTime-wantedTime > 100) // if waiting for more than 100 ms
{
//Sleep for value significantly lower than the 100 ms, to ensure that we don't "oversleep"
std::this_thread::sleep_for(50ms);
}
}
Now this is a bit race condition prone, as it assumes that the OS will hand back control of the program within 50ms after the sleep_for is done. To further combat this you could turn it down (to say, sleep 1ms).

You can set the Windows timer resolution to minimum (usually 1 ms), to make Sleep() accurate up to 1 ms. By default it would be accurate up to about 15 ms. Sleep() documentation.
Note that your execution can be delayed if other programs are consuming CPU time, but this could also happen if you were waiting with a timer.
#include <timeapi.h>
// Sleep() takes 15 ms (or whatever the default is)
Sleep(1);
TIMECAPS caps_;
timeGetDevCaps(&caps_, sizeof(caps_));
timeBeginPeriod(caps_.wPeriodMin);
// Sleep() now takes 1 ms
Sleep(1);
timeEndPeriod(caps_.wPeriodMin);

C++ How to make precise frame rate limit?

I'm trying to create a game using C++ and I want to create limit for fps but I always get more or less fps than I want. When I look at games that have fps limit it's always precise framerate. Tried using Sleep() std::this_thread::sleep_for(sleep_until). For example Sleep(0.01-deltaTime) to get 100 fps but ended up with +-90fps.
How do these games handle fps so precisely when any sleeping isn't precise?
I know I can use infinite loop that just checks if time passed but it's using full power of CPU but I want to decrease CPU usage by this limit without VSync.

Yes, sleep is usually inaccurate. That is why you sleep for less than the actual time it takes to finish the frame. For example, if you need 5 more milliseconds to finish the frame, then sleep for 4 milliseconds. After the sleep, simply do a spin-lock for the rest of the frame. Something like
float TimeRemaining = NextFrameTime - GetCurrentTime();
Sleep(ConvertToMilliseconds(TimeRemaining) - 1);
while (GetCurrentTime() < NextFrameTime) {};
Edit: as stated in another answer, timeBeginPeriod() should be called to increase the accuracy of Sleep(). Also, from what I've read, Windows will automatically call timeEndPeriod() when your process exits if you don't before then.

You could record the time point when you start, add a fixed duration to it and sleep until the calculated time point occurs at the end (or beginning) of every loop. Example:
#include <chrono>
#include <iostream>
#include <ratio>
#include <thread>
template<std::intmax_t FPS>
class frame_rater {
public:
frame_rater() : // initialize the object keeping the pace
time_between_frames{1}, // std::ratio<1, FPS> seconds
tp{std::chrono::steady_clock::now()}
{}
void sleep() {
// add to time point
tp += time_between_frames;
// and sleep until that time point
std::this_thread::sleep_until(tp);
}
private:
// a duration with a length of 1/FPS seconds
std::chrono::duration<double, std::ratio<1, FPS>> time_between_frames;
// the time point we'll add to in every loop
std::chrono::time_point<std::chrono::steady_clock, decltype(time_between_frames)> tp;
};
// this should print ~10 times per second pretty accurately
int main() {
frame_rater<10> fr; // 10 FPS
while(true) {
std::cout << "Hello world\n";
fr.sleep(); // let it sleep any time remaining
}
}

The accepted answer sounds really bad. It would not be accurate and it would burn the CPU!
Thread.Sleep is not accurate because you have to tell it to be accurate (by default is about 15ms accurate - means that if you tell it to sleep 1ms it could sleep 15ms).
You can do this with Win32 API call to timeBeginPeriod & timeEndPeriod functions.
Check MSDN for more details -> https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
(I would comment on the accepted answer but still not having 50 reputation)

Be very careful when implementing any wait that is based on scheduler sleep.
Most OS schedulers have higher latency turn-around for a wait with no well-defined interval or signal to bring the thread back into the ready-to-run state.
Sleeping isn't inaccurate per-se, you're just approaching the problem all wrong. If you have access to something like DXGI's Waitable Swapchain, you synchronize to the DWM's present queue and get really reliable low-latency timing.
You don't need to busy-wait to get accurate timing, a waitable timer will give you a sync object to reschedule your thread.
Whatever you do, do not use the currently accepted answer in production code. There's an edge case here you WANT TO AVOID, where Sleep (0) does not yield CPU time to higher priority threads. I've seen so many game devs try Sleep (0) and it's going to cause you major problems.

Use a timer.
Some OS's can provide special functions. For example, for Windows you can use SetTimer and handle its WM_TIMER messages.
Then calculate the frequency of the timer. 100 fps means that the timer must fire an event each 0.01 seconds.
At the event handler for this timer-event you can do your rendering.
In case the rendering is slower than the desired frequency then use a syncro flag OpenGL sync and discard the timer-event if the previous rendering is not complete.

You may set a const fps variable to your desired frame rate, then you can update your game if the elapsed time from last update is equal or more than 1 / desired_fps.
This will probably work.
Example:
const /*or constexpr*/ int fps{60};
// then at update loop.
while(running)
{
// update the game timer.
timer->update();
// check for any events.
if(timer->ElapsedTime() >= 1 / fps)
{
// do your updates and THEN renderer.
}
}

High CPU usage while using poll system call to wait on fds

I have this unique problem where the poll system call of linux used in my code gets the fds it waits on polled in , i mean POLLIN every millisecond. This is causing high CPU usage . I have supplied a timeout of 100 milliseconds and it seems to be of no use. Can any one suggest an alternative.
for (;;) {
ACE_Time_Value doWork(0, 20000);
ACE_OS::sleep(doWork); ----------------------------> Causing low throughput, put to decrease CPU usage / On removing this we see high CPU , but throughput is achieved.
..
.
..
if ((exitCode = fxDoWork()) < 0) {
break;}
}
fxDoWork()
{
ACE_Time_Value selectTime;
selectTime.set(0, 100000);
..
..
..
ACE_INT32 waitResult = ACE_OS::poll(myPollfds, eventCount, &selectTime);-----------------------------> Pollin happens for every milli second/Timeout is not at all useful
..
..
..
}
===============================================================

It sounds like you want to accumulate enough data OR a specific timeout happens to reduce CPU usage, right? If that's the case, you can use recvmmsg(): http://man7.org/linux/man-pages/man2/recvmmsg.2.html
The recvmmsg() system call is an extension of recvmsg(2) that allows
the caller to receive multiple messages from a socket using a single
system call. (This has performance benefits for some applications.)
A further extension over recvmsg(2) is support for a timeout on the
receive operation.

does while loop always take full CPU usage?

I need create a server side game loop, the problem is how to limit the loop cpu usage.
In my experience of programming, a busy loop always take maximal CPU usage it could. But I am reading the code of SDL(Simple DirectMedia Layer), it has a function SDL_Delay(UINT32 ms), and it has a while loop, does it take max cpu usage, if not, why?
https://github.com/eddieringle/SDL/blob/master/src/timer/unix/SDL_systimer.c#L137-158
do {
errno = 0;
#if HAVE_NANOSLEEP
tv.tv_sec = elapsed.tv_sec;
tv.tv_nsec = elapsed.tv_nsec;
was_error = nanosleep(&tv, &elapsed);
#else
/* Calculate the time interval left (in case of interrupt) */
now = SDL_GetTicks();
elapsed = (now - then);
then = now;
if (elapsed >= ms) {
break;
}
ms -= elapsed;
tv.tv_sec = ms / 1000;
tv.tv_usec = (ms % 1000) * 1000;
was_error = select(0, NULL, NULL, NULL, &tv);
#endif /* HAVE_NANOSLEEP */
} while (was_error && (errno == EINTR));

This code uses select for a timeout. select usually takes a file descriptor, and makes the caller wait until an IO event occurs on the fd. It also takes a timeout argument for the maximum time to wait. Here the fd is 0, so no events will occur, and the function will always return when the timeout is reached.
The select(3) that you get from the C library is a wrapper around the select(2) system call, which means calling select(3) eventually gets you in the kernel. The kernel then doesn't schedule the process unless an IO event occurs, or the timeout is reached. So the process is not using the CPU while waiting.
Obviously, the jump into the kernel and process scheduling introduce delays. So if you must have very low latency (nanoseconds) you should use busy waiting.

That loop won't take up all CPU. It utilizes one of two different functions to tell the operating system to pause the thread for a given amount of time and letting another thread utilize the CPU:
// First function call - if HAVE_NANOSLEEP is defined.
was_error = nanosleep(&tv, &elapsed);
// Second function call - fallback without nanosleep.
was_error = select(0, NULL, NULL, NULL, &tv);

While the thread is blocked in SDL_Delay, it yields the CPU to other tasks. If the delay is long enough, the operating system will even put the CPU in an idle or halt mode if there is no other work to do. Note that this won't work well if the delay time isn't at least 20 milliseconds or so.
However, this is usually not the right way to do whatever it is you are trying to do. What is your outer problem? Why doesn't your game loop ever finish doing whatever needs to be done at this time and so then need to wait for something to happen so that it has more work to do? How can it always have an infinite amount of work to do immediately?

Make select based loop as responsive as possible

This thread will be very responsive to network activity but can be guaranteed to process the message queue only as often as 100 times a second. I can keep reducing the timeout but after a certain point I will be busy-waiting and chewing up CPU. Is it true that this solution is about as good as I'll get without switching to another method?
// semi pseudocode
while (1) {
process_thread_message_queue(); // function returns near-instantly
struct timeval t;
t.tv_sec = 0;
t.tv_usec = 10 * 1000; // 10ms = 0.01s
if (select(n,&fdset,0,0,t)) // see if there are incoming packets for next 1/100 sec
{
... // respond with more packets or processing
}
}

It depends on what your OS provides for your. On Windows you can wait for a thread message and a bunch of handles simultaneously using MsgWaitForMultipleObjectsEx. This solves your problem. On other OS you should have something similar.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js