I have discovered that SwapBuffers in OpenGL will busy-wait as long as the graphics card isn't done with its rendering or if it's waiting on V-Sync.
This is a problem for me because I don't want to waste 100% of a CPU core while just waiting for the card to be finished. I'm not writing a game, so I can not use the CPU cycles for anything more productive, I just want to yield them to some other process in the operating system.
I've found callback-functions such as glutTimerFunc and glutIdleFunc that could work for me, but I don't want to use glut. Still, glut must in some way use the normal gl functions to do this, right?
Is there any function such as "glReadyToSwap" or similar? In that case I could check that every millisecond or so and determine if I should wait a while longer or do the swap. I could also imagine perhaps skip SwapBuffers and write my own similar function that doesn't busy-wait if someone could point me in the right direction.
SwapBuffers is not busy waiting, it just blocks your thread in the driver context, which makes Windows calculating the CPU usage wrongly: Windows calculates the CPU usage by determining how much CPU time the idle process gets + how much time programs don't spend in driver context. SwapBuffers will block in driver context and your program obviously takes away that CPU time from the idle process. But your CPU is doing literally nothing in the time, the scheduler happily waiting to pass the time to other processes. The idle process OTOH does nothing else than immediately yield its time to the rest of the system, so the scheduler jumps right back into your process, which blocks in the driver what Windows counts as "is clogging CPU". If you'd measure the actual power consumption or heat output, for a simple OpenGL program this will stay rather low.
This irritating behaviour is actually an OpenGL FAQ!
Just create additional threads for parallel data processing. Keep OpenGL in one thread, the data processing in the other. If you want to get down the reported CPU usage, adding a Sleep(0) or Sleep(1) after SwapBuffers will do the trick. The Sleep(1) will make your process spend blocking a little time in user context, so the idle process gets more time, which will even out the numbers. If you don't want to sleep, you may do the following:
const float time_margin = ... // some margin
float display_refresh_period; // something like 1./60. or so.
void render(){
float rendertime_start = get_time();
render_scene();
glFinish();
float rendertime_finish = get_time();
float time_to_finish = rendertime_finish - rendertime_start;
float time_rest = fmod(render_finish - time_margin, display_refresh_period);
sleep(time_rest);
SwapBuffers();
}
In my programs I use this kind of timing but for another reason: I let SwapBuffers block without any helper Sleeps, however I give some other worker threads about that time to do stuff on the GPU through shared context (like updating textures) and I have the garbage collector running. It's not really neccesary to exactly time it, but the worker threads being finished just before SwapBuffers returns allows one to start rendering the next frame almost immediately since most mutexes are already unlocked then.
Though eglSwapBuffers does not busy wait a legitimate use for a nonblocking eglSwapBuffers is to have a more responsive GUI thread that can listen to user input or exit signals instead of waiting for OpenGL to finish swapping buffers. I have a solution to half of this problem. First in your main loop you buffer up your OpenGL commands to execute on your swapped out buffer. Then you poll on a sync object to see if your commands have finished executing on your swapped out buffer. Then you can swap buffers if the commands have finished executing. Unfortunately, this solution only asynchronously waits for commands to finish executing on your swapped out buffer and does not asynchronously wait for vsync. Here is the code:
void process_gpu_stuff(struct gpu_context *gpu_context)
{
int errnum = 0;
switch (gpu_context->state) {
case BUFFER_COMMANDS:
glDeleteSync(gpu_context->sync_object);
gpu_context->sync_object = 0;
real_draw(gpu_context);
glFlush();
gpu_context->sync_object = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
if (0 == gpu_context->sync_object) {
errnum = get_gl_error();
break;
}
gpu_context->state = SWAP_BUFFERS;
break;
case SWAP_BUFFERS:
/* Poll to see if the buffer is ready for swapping, if
* it is not in ready we can listen for updates in the
* meanwhile. */
switch (glClientWaitSync(gpu_context->sync_object, 0, 1000U)) {
case GL_ALREADY_SIGNALED:
case GL_CONDITION_SATISFIED:
if (EGL_FALSE == eglSwapBuffers(display, surface)) {
errnum = get_egl_error();
break;
}
gpu_context->state = BUFFER_COMMANDS;
break;
case GL_TIMEOUT_EXPIRED:
/* Do nothing. */
break;
case GL_WAIT_FAILED:
errnum = get_gl_error();
break;
}
break;
}
}
the popular answer here is wrong. windows is not reporting the cpu usage "wrongly", lol. opengl, with vsync on, even while rendering a blank screen is actually burning 100% of 1 thread of your cpu. (you can check your CPU temps)
but the solution is simple. just call DwmFlush(); before or after SwapBuffers
Related
This is most noticeable on graphic files. Let's take as an example the OpenGL base program (a spinning triangle).
Whenever I run one normally, with no other apps open in the background, it will spin slowly, but when I run a game in the background, it starts spinning like mad. It seems as if the computer doesn't allocate enough memory for the programs to run at maximum speed, and paradoxically, doing resource-consuming stuff will accelerate it because it gets more memory.
The only way I found to fix this partially is to put a higher value in the Sleep function, however this doesn't fix it completely nor is a consistent solution, as other problems may arise from it. Is there any good way to fix this and make the program run consistently?
This mostly happens because you are not capping your FPS so there's nothing preventing your render loop from being called as much as possible and your logic (that is controlling rotation is executing in same loop).
What happens is that most GPUs have power management so they keep their frequencies low when there's no demand, opening an expensive game makes your GPU bump up its power thus rendering a lot faster, thus calling your rendering loop more times.
To prevent this (and to separate logic from rendering time in general) you must control the frame rate and use the time as an input for your rotation, something like:
auto elapsed = ..
while (!exit) {
render();
auto delta = now - elapsed;
if (delta < time_per_frame)
delay(TIME_PER_FRAME - delta);
updateLogic(delta);
}
For starters, you need to understand what's going on with your program. It has nothing to do with memory, and I don't see a reason to think about memory.
Opening other programs could make your CPU go faster because of the load (doubtfull, but clearly more likely than memory allocation).
The other programs could be messing with some setting.
If you're using sleep(), signals can interrupt the call (no one ever looks at the return code of the function; there's a reason for it to be uint sleep(uint) and not void sleep(uint)).
If you can, don't use sleep. And if you're going to, check. sleep doesn't grant you that the whole time has passed (IMHO bad design, but I'm not a POSIX fan).
The usual behaviour I think would be to have your function called periodically as a callback. If you're going to do some sort of delay or sleep, you should check that the time has passed.
Taken that you want your logic tied to the render and you use some function like sleep that can be interrupted (based on the other answer):
while (!exit) {
auto startofFrame= now();
render();
auto toDelay= startOfFrame + TIME_PER_FRAME - now();
while (toDelay > 0) {
delay(toDelay);
toDelay= startOfFrame + TIME_PER_FRAME - now();
}
updateLogic();
}
I have two threads:
GUI, which does the typical GUI stuff and manages a bunch of flags that affect the Processing thread
Processing, which handles realtime data on a 30Hz period forever
There are lots of examples of how to have one thread wait for another to finish, but none for how to make a temporary roadbock without killing the thread.
There's a function in my GUI thread that contains this:
Scene* scene = getSceneToFadeFrom();
scene->setSelected(false);
///TODO: wait until (!scene->processing)
fadeFrom = scene->dmx;
and one in my Processing thread that contains this while looping through a QList:
if(scene->getSelected())
{
scene->processing = true;
scene->run(); //updates scene->dmx
scene->processing = false;
}
If this were an embedded project on bare metal, I could use the global interrupt enable flag in place of scene->processing (invert the logic) and be done, which dedicates the entire CPU to that task at the expense of all others.
But because this is a desktop project with an operating system to play nice with, how can I achieve the same effect within this project? Basically, pause the GUI thread at that point until scene->processing == false (which it might be already) and guarantee that the Processing thread is actually running while the GUI thread waits for it.
And here's what I came up with. It was actually an XY problem. I'm surprised that I didn't think of this right away because I had already done something similar for deleting a Scene:
GUI thread:
//(sceneToReplace != 0) means there's something for Processing to do
sceneToReplace = getSceneToFadeFrom();
if(sceneToReplace)
{
sceneToReplace->setSelected(false);
}
Processing thread, same class:
if(sceneToReplace)
{
fadeFrom = sceneToReplace->dmx;
sceneToReplace = 0;
}
and I don't even need the processing flag anymore!
fadeFrom gets set a little later than in the original veresion, but it's not actually needed until then anyway.
Everything I've found so far regarding timers is that it's, at best, available at a 1ms resolution. QTimer's docs claim that's the best it can provide.
I understand that OSes like Windows are not real-time OSes, but I still want to ask this question in hopes that someone knows something that could help.
So, I'm writing an app that requires a function to be called at a fairly precise but arbitrary interval, say 60 times/sec (full range: 59-61Hz). That means I need it to be called, on average, every ~16.67ms. This part of the design can't change.
The best timing source I currently have is vsync. When I go off of that, it's pretty good. It's not ideal, because the monitor's frequency is not exactly what I need to call this function at, but it can be somewhat compensated for.
The kicker is that the level of accuracy given the range I'm after is more or less available with timers, but not the level of precision I want. I can get a 16ms timer to hit exactly 16ms ~97% of the time. I can get a 17ms timer to hit exactly 17ms ~97% of the time. But no API exists to get me 16.67?
Is what I'm looking for simply not possible?
Background: The project is called Phoenix. Essentially, it's a libretro frontend. Libretro "cores" are game console emulators encapsulated in individual shared libraries. The API function being called at a specific rate is retro_run(). Each call emulates a game frame and calls callbacks for audio, video and so on. In order to emulate at a console's native framerate, we must call retro_run() at exactly (or as close to) this rate, hence the timer.
You could write a loop that checks std::chrono::high_resolution_clock() and std::this_thread::yield() until the right time has elapsed. If the program needs to be responsive while this is going on, you should do it in a separate thread from the one checking the main loop.
Some example code:
http://en.cppreference.com/w/cpp/thread/yield
An alternative is to use QElapsedTimer with a value of PerformanceCounter. You will still need to check it from a loop, and probably will still want to yield within that loop. Example code: http://doc.qt.io/qt-4.8/qelapsedtimer.html
It is completely unnecessary to call retro_run at any highly controlled time in particular, as long as the average frame rate comes out right, and as long as your audio output buffers don't underflow.
First of all, you are likely to have to measuring the real time using an audio-output-based timer. Ultimately, each retro_run produces a chunk of audio. The audio buffer state with the chunk added is your timing reference: if you run early, the buffer will be too full, if you run late, the buffer will be too empty.
This error measure can be fed into a PI controller, whose output gives you the desired delay until the next invocation of retro_run. This will automatically ensure that your average rate and phase are correct. Any systematic latencies in getting retro_run active will be integrated away, etc.
Secondly, you need a way of waking yourself up at the correct moment in time. Given a target time (in terms of a performance counter, for example) to call retro_run, you'll need a source of events that wake your code up so that you can compare the time and retro_run when necessary.
The simplest way of doing this would be to reimplement QCoreApplication::notify. You'll have a chance to retro_run prior to the delivery of every event, in every event loop, in every thread. Since system events might not otherwise come often enough, you'll also want to run a timer to provide a more dependable source of events. It doesn't matter what the events are: any kind of event is good for your purpose.
I'm not familiar with threading limitations of retro_run - perhaps you can run it in any one thread at a time. In such case, you'd want to run it on the next available thread in a pool, perhaps with the exception of the main thread. So, effectively, the events (including timer events) are used as energetically cheap sources of giving you execution context.
If you choose to have a thread dedicated to retro_run, it should be a high priority thread that simply blocks on a mutex. Whenever you're ready to run retro_run when a well-timed event comes, you unlock the mutex, and the thread should be scheduled right away, since it'll preempt most other threads - and certainly all threads in your process.
OTOH, on a low core count system, the high priority thread is likely to preempt the main (gui) thread, so you might as well invoke retro_run directly from whatever thread got the well-timed event.
It might of course turn out that using events from arbitrary threads to wake up the dedicated thread introduces too much worst-case latency or too much latency spread - this will be system-specific and you may wish to collect runtime statistics, switch threading and event source strategies on the fly, and stick with the best one. The choices are:
retro_run in a dedicated thread waiting on a mutex, unlock source being any thread with a well-timed event caught via notify,
retro_run in a dedicated thread waiting for a timer (or any other) event; events still caught via notify,
retro_run in a gui thread, unlock source being the events delivered to the gui thread, still caught via notify,
any of the above, but using timer events only - note that you don't care which timer events they are, they don't need to come from your timer,
as in #4, but selective to your timer only.
My implementation based on Lorehead's answer. Time for all variables are in ms.
It of course needs a way to stop running and I was also thinking about subtracting half the (running average) difference between timeElapsed and interval to make the average +-n instead of +2n, where 2n is the average overshoot.
// Typical interval value: 1/60s ~= 16.67ms
void Looper::beginLoop( double interval ) {
QElapsedTimer timer;
int counter = 1;
int printEvery = 240;
int yieldCounter = 0;
double timeElapsed = 0.0;
forever {
if( timeElapsed > interval ) {
timer.start();
counter++;
if( counter % printEvery == 0 ) {
qDebug() << "Yield() ran" << yieldCounter << "times";
qDebug() << "timeElapsed =" << timeElapsed << "ms | interval =" << interval << "ms";
qDebug() << "Difference:" << timeElapsed - interval << " -- " << ( ( timeElapsed - interval ) / interval ) * 100.0 << "%";
}
yieldCounter = 0;
importantBlockingFunction();
// Reset the frame timer
timeElapsed = ( double )timer.nsecsElapsed() / 1000.0 / 1000.0;
}
timer.start();
// Running this just once means massive overhead from calling timer.start() so many times so quickly
for( int i = 0; i < 100; i++ ) {
yieldCounter++;
QThread::yieldCurrentThread();
}
timeElapsed += ( double )timer.nsecsElapsed() / 1000.0 / 1000.0;
}
}
I've run into an odd issue with some OpenCL code that I'm working on where every once in a blue moon, Windows TDR will kick in and reset the GPU. The offending kernel runs for only 150ms and will run thousands of times (over the course of many hours) before the TDR kills it off, so I'm certain that the kernel itself isn't to blame.
My concern is that once the TDR kicks in, the kernel dies and the program is stuck in an eternal state of limbo. From what I can tell the call to clFinish never returns.
Is there a way to detect if a kernel has died off so that it can be handled gracefully?
I managed to come up with a solution, although it's far from optimal.
I've modified the program so that the OpenCL processing is done in a separate thread. I created a global shared watchdog variable between the parent and child process. When the parent spawns the processing function as a thread, it sets the variable to the current time in milliseconds. When the processing thread finishes, it reset the watchdog variable to zero.
While the parent thread waits for the processing thread to finish, it keeps an eye on the watchdog timer. If the timer exceeds a certain threshold then the program forcefully terminates itself without waiting for the child process to return.
This solution works with or without Windows TDR set. If TDR is set and the driver resets, the call to clFinish() will never return and the parent will terminate once the watchdog timer trips. If TDR is not set, the runaway process will freeze the display, but once the watchdog timer trips, the parent will terminate processing, ending the freeze.
Now that I have a watchdog set up, I simply wrapped my program in a script: if it terminated in error (positive return code) then the program is rerun.
Ideally, you should get an error code from clFinish or clWaitForEvents with the OpenCL event object generated when enqueuing the kernel. Since TDR resets the graphics driver, I don't think any OpenCL implementation will work reliably, meaning there is no recovery route.
Rather disable TDR completely. It is only worthwhile when you debug code that gets stuck in an infinite loop that permanently keeps the GPU busy.
If you want to keep TDR but can change the code then using some sort of thread sleep function to delay your code for a few milliseconds could also alleviate this problem, at the expense of sacrificing processing speed. This gives the graphics card a chance to respond to display rendering commands so that TDR is not triggered.
I'm writing a check point. I'm checking every time I run a loop. I think this will waste a lot of CPU time. I wonder how to check with the system time every 10 seconds?
time_t start = clock();
while(forever)
{
if(difftime(clock(),start)/CLOCKS_PER_SEC >=timeLimit)
{
break;
}
}
The very short answer is that this is very difficult, if you're a novice programmer.
Now a few possiblilites:
Sleep for ten seconds. That means your program is basically pointless.
Use alarm() and signal handlers. This is difficult to get right, because you mustn't do anything fancy inside the signal handler.
Use a timerfd and integrate timing logic into your I/O loop.
Set up a dedicated thread for the timer (which can then sleep); this is exceedingly difficult because you need to think about synchronising all shared data access.
The point to take home here is that your problem doesn't have a simple solution. You need to integrate the timing logic deeply into your already existing program flow. This flow should be some sort of "main loop" (e.g. an I/O multiplexing loop like epoll_wait or select), possibly multi-threaded, and that loop should pick up the fact that the timer has fired.
It's not that easy.
Here's a tangent, possibly instructive. There are basically two kinds of computer program (apart from all the other kinds):
One kind is programs that perform one specific task, as efficiently as possible, and are then done. This is for example something like "generate an SSL key pair", or "find all lines in a file that match X". Those are the sort of programs that are easy to write and understand as far as the program flow is concerned.
The other kind is programs that interact with the user. Those programs stay up indefinitely and respond to user input. (Basically any kind of UI or game, but also a web server.) From a control flow perspective, these programs spend the vast majority of their time doing... nothing. They're just idle waiting for user input. So when you think about how to program this, how do you make a program do nothing? This is the heart of the "main loop": It's a loop that tells the OS to keep the process asleep until something interesting happens, then processes the interesting event, and then goes back to sleep.
It isn't until you understand how to do nothing that you'll be able to design programs of the second kind.
If you need precision, you can place a call to select() with null parameters but with a delay. This is accurate to the millisecond.
struct timeval timeout= {10, 0};
select(1,NULL,NULL,NULL, &timeout);
If you don't, just use sleep():
sleep(10);
Just add a call to sleep to yield CPU time to the system:
time_t start = clock();
while(forever)
{
if(difftime(clock(),start)/CLOCKS_PER_SEC >=timeLimit)
{
break;
}
sleep(1); // <<< put process to sleep for 1s
}
You can use a event loop in your program and schedule timer to do a callback. For example you can use libev to make an event loop and add timer.
ev_timer_init (timer, callback, 0., 5.);
ev_timer_again (loop, timer);
...
timer->again = 17.;
ev_timer_again (loop, timer);
...
timer->again = 10.;
ev_timer_again (loop, timer);
If you code in a specific toolkit you can use other event loops, gtk, qt, glib has own event loops so you can use them.
The simplest approach (in a single threaded environment), would be to sleep for some time and repeatedly check if the total waiting time has expired.
int sleepPeriodMs = 500;
time_t start = clock();
while(forever)
{
while(difftime(clock(),start)/CLOCKS_PER_SEC <timeLimit) // NOTE: Change of logic here!
{
sleep(sleepPeriod);
}
}
Please note, that sleep() is not very accurate. If you need higher accuracy timing (i.e. better than 10ms of resolution) you might need to dig deeper. Also, with C++11 there is the <chrono> header that offers a lot more of functionality.
using namespace std::chrono;
int sleepPeriodMs = 500;
time_t start = clock();
while(forever)
{
auto start = system_clock()::now()
// do some stuff that takes between [0..10[ seconds
std::this_thread::sleep_until(start+seconds(10));
}