I have two programs that are supposed to do the same thing with slight differences. Both have infinite game loops that runs forever unless user stops the game somehow. One of these programs' game loop is implemented and rendering something, the other game loop is empty and does nothing(just listens for user to stop).
When i opened the task manager to see resource usage, i have discovered that the program with the empty loop uses 14% CPU and the program that actually draws something to screen uses about 1-2%.
My guess on the subject is as follows:
I compared the code of the both programs and looked for differences and there was not much. Then it occurred to me that the loop that renders to screen might be bound by other factors(like sending pixels to the screen, refresh rate maybe?) So after CPU does its thing, it puts that thread to sleep until other stuff is completed. But since other program does pretty much nothing and doing nothing is really easy, CPU never puts that thread to sleep and just keeps going. I lack the knowledge to confirm that if this is the reason, so i am asking you. Is this the reason this is happening? (Bonus question) And if so, why does the CPU stop at about 14% and not going all the way up to 100% ?
Thank you.
Hard to say for certain without seeing the code, but drawing to the screen will, inevitably involve some wait on IO; how much depends on many factors including sync + buffering options.
As for the 14% cpu usage - I'm guessing that your machine has 8 processing units (either cores or cores * hyperthreading) and your code is singlethreaded - i.e. it is maxing out one processing unit.
Related
I am considering a programming project. Will run under Ubuntu or other Linux OS on a small board. Quad core x86 - N-Series Pentium. The software generates 8 fast signals; square wave pulse trains for stepper motor motion control of 4 axes. Step signals being 50-100 KHz maximum, but usually slower. Want to avoid jitter in these stepping signals (call it good fidelity), so that around 1-2us for each thread loop cycle would be a nice target. The program does other kinds of tasks, like read/write hard drive, Ethernet, continues update on the graphics display, keyboard. The existing Single core programs just can not process motion signals with this kind of timing and require external hardware/techniques to achieve this.
I have been reading other posts here, like on a thread running selected core, continuously. The exact meaning in these posts is "lose", not sure really what is meant. Continuous might mean testing every minute or ?????
So, I might be wordy, but it will clear I hope. The considered program has all the threads, routines, memory, shared memory all included. I am not considering that this program launches another program or service. Other threads are written in this program and launched when the program starts up. I will call this signal generating thread the FAST THREAD.
The FAST THREAD is to be launched to an otherwise "free" core. It needs to be the only thread that runs on the core. Hopefully, the OS thread scheduler component on that core can be "turned off", so that it does not even interrupt on that core to decide what thread runs next. In looking at the processor manual, Each core has a counter timer chip. Is it possible then that I can use it to provide a continuous train of interrupts then into my "locked in" FAST THREAD for timing purposes? This is the range of about 1-2 us. If not, then just reading one channel on that CTC to provide software sync. This fast thread will, therefore, see (experience) no delays from the interrupts issued in the other cores and associated multicore fabric. This FAST THREAD, when running, will continue to run until the program closes. This could be hours.
Input data to drive this FAST THREAD will be common shared memory defined in the program. There are also hardware signals for motion limits (From GPIOs or SDI port). If any go TRUE, that forces a programmed halt all motion. It does not need a 1~2us response. It could go to a slower Motion loop.
Ah, the output:
Some motion data is written back to the shared memory (assigned for this purpose). Like current location, and current loop number,
Some signals need to be output (the 8 outputs). There are numerous free GPIOs. Not sure of the path taken to get the signaled GPIO pin to change the output. The system call to Linux initiates the pin change event. There is also an SDI port available, running up to the 25Mhz clock. It seems these ports (GPIO, UART, USB, SDI) exist in the fabric that is not on any specific core. I am not sure of the latency from the issuance of these signals in the program until the associated external pin actually presents that signal. In the fast thread, even 10us would be OK, if it was always the same latency! I know that will not be so, there will jitter. I need to think on this spec.
There will possibly be a second dedicated core (similar to above) for slower motion planning. That leaves two cores for everything thing else. Since then everything else items (sata, video screen, keyboard ...) are already able to work in a single core, then the remaining two cores should be great.
At close of program, the FAST THREAD returns the CTC and any other device on its core back to "as it was", re-enables the OS components in this core to their more normal operation. End of thread.
Concluding: I have described the overall program, so as for you to understand what I want to do with this FAST THREAD running, how responsive it needs to be, and that it needs to be undisturbed!! This processor runs in the 1.5 ~ 2.0 GHz range. It certainly can do the repeated calculations in the required time frame.
DESIRED: I do not know the system calls that would allow me to use a selected x86 core in this way. Any pointers would be helpful. Any manual or document that described these calls/procedures.
Can this use of a core also be done in windows 7,10)?
Thanks for reading and any pointers you have.
Stan
Solved: For when simple profiling isn't effective enough, I have written a tool to show me where performance hits occur. Basic information about how the tool works is in the accepted answer below. The source can be found here: http://pastebin.com/ETiW8hE8 (be sure to turn debugging symbols on in the program you're testing)
I've built a game engine in C++ and I have noticed in one particular area of a level that there is a brief performance hit. The game will stop completely for about half a second, and then continue on merrily. I've tried to profile this, but it's difficult isolate the condition since I also have to load the map and perform the in-game task which causes the performance hit. I can make a map load automatically and skip showing menus, etc, and comparing those profile results against a set of similar control data (all the same steps but without actually initiating the performance hit), but it doesn't show anything obvious.
I'm using gmon to profile.
This is a large application with many, many classes and functions. The performance hit only happens once, so there's no way to just trigger the problem many times during one execution to saturate my profiling results in order to make the offending functions more obvious in the profiling results.
What else can I do?
What I would do is try to grab a stack sample in that half second when it's frozen.
This would require an alarm clock timer set to go off some small time in the future, like 100ms.
Then in some loop, like the frame display loop, that normally takes less than 100ms to repeat, keep resetting the timer.
That way, it will act as a watchdog that barks if you don't keep petting it.
Then, stick a breakpoint in the timer interrupt handler.
When it gets there, you know you're in the bad slice of time.
Then just display the call stack, and it should show you what the problem is.
You might have to repeat the process a few times.
You are not saying anything about whether your application is threaded, but I will assume that it is not.
As per suggestion from mike, get insights by getting a stack trace at and see where it is freezing, you can do that with a bit of luck using pstack, so
while usleep 100000; do
pstack processid
done >/tmp/stack.log
Should give you some output to go on -- my guess is that you are calling a blocking IO operation, like reading some assets from disk.
So I was reading this article which contains 'Tips and Advice for Multithreaded Programming in SDL' - https://vilimpoc.org/research/portmonitorg/sdl-tips-and-tricks.html
It talks about SDL_PollEvent being inefficient as it can cause excessive CPU usage and so recommends using SDL_WaitEvent instead.
It shows an example of both loops but I can't see how this would work with a game loop. Is it the case that SDL_WaitEvent should only be used by things which don't require constant updates ie if you had a game running you would perform game logic each frame.
The only things I can think it could be used for are programs like a paint program where there is only action required on user input.
Am I correct in thinking I should continue to use SDL_PollEvent for generic game programming?
If your game only updates/repaints on user input, then you could use SDL_WaitEvent. However, most games have animation/physics going on even when there is no user input. So I think SDL_PollEvent would be best for most games.
One case in which SDL_WaitEvent might be useful is if you have it in one thread and your animation/logic on another thread. That way even if SDL_WaitEvent waits for a long time, your game will continue painting/updating. (EDIT: This may not actually work. See Henrik's comment below)
As for SDL_PollEvent using 100% CPU as the article indicated, you could mitigate that by adding a sleep in your loop when you detect that your game is running more than the required frames-per-second.
If you don't need sub-frame precision in your input, and your game is constantly animating, then SDL_PollEvent is appropriate.
Sub-frame precision can be important for, eg. games where the player might want very small increments in movement - quickly tapping and releasing a key has unpredictable behavior if you use the classic lazy method of keydown to mean "velocity = 1" and keyup to mean "velocity = 0" and then you only update position once per frame. If your tap happens to overlap with the frame render then you get one frame-duration of movement, if it does not you get no movement, where what you really want is an amount of movement smaller than the length of a frame based on the timestamps at which the events occurred.
Unfortunately SDL's events don't include the actual event timestamps from the operating system, only the timestamp of the PumpEvents call, and WaitEvent effectively polls at 10ms intervals, so even with WaitEvent running in a separate thread, the most precision you'll get is 10ms (you could maybe approximate smaller by saying if you get a keydown and keyup in the same poll cycle then it's ~5ms).
So if you really want precision timing on your input, you might actually need to write your own version of SDL_WaitEventTimeout with a smaller SDL_Delay, and run that in a separate thread from your main game loop.
Further unfortunately, SDL_PumpEvents must be run on the thread that initialized the video subsystem (per https://wiki.libsdl.org/SDL_PumpEvents ), so the whole idea of running your input loop on another thread to get sub-frame timing is nixed by the SDL framework.
In conclusion, for SDL applications with animation there is no reason to use anything other than SDL_PollEvents. The best you can do for sub-framerate input precision is, if you have time to burn between frames, you have the option of being precise during that time, but then you'll get weird render-duration windows each frame where your input loses precision, so you end up with a different kind of inconsistency.
In general, you should use SDL_WaitEvent rather than SDL_PollEvent to release the CPU to the operating system to handle other tasks, like processing user input. This will manifest to you users as sluggish reaction to user input, since this can cause a delay between when they enter a command and when your application processes the event. By using SDL_WaitEvent instead, the OS can post events to your application more quickly, which improves the perceived performance.
As a side benefit, users on battery powered systems, like laptops and portable devices should see slightly less battery usage since the OS has the opportunity to reduce overall CPU usage since your game isn't using it 100% of the time - it would only be using it when an event actually occurs.
This is a very late response, I know. But this is the thread that tops a Google search on this, so it seems the place to add an alternative suggestion to dealing with this that some might find useful.
You could write your code using SDL_WaitEvent, so that, when your application is not actively animating anything, it'll block and hand the CPU back to the OS.
But then you can send a user-defined message to the queue, from another thread (e.g. the game logic thread), to wake up the main rendering thread with that message. And then it goes through the loop to render a frame, swap and returns back to SDL_WaitEvent again. Where another of these user-defined messages can be waiting to be picked up, to tell it to loop once more.
This sort of structure might be good for an application (or game) where there's a "burst" of animation, but otherwise it's best for it to block and go idle (and save battery on laptops).
For example, a GUI where it animates when you open or close or move windows or hover over buttons, but it's otherwise static content most of the time.
(Or, for a game, though it's animating all the time in-game, it might not need to do that for the pause screen or the game menus. So, you could send the "SDL_ANIMATEEVENT" user-defined message during gameplay, but then, in the game menus and pause screen, just wait for mouse / keyboard events and actually allow the CPU to idle and cool down.)
Indeed, you could have self-triggering animation events. In that the rendering thread is woken up by a "SDL_ANIMATEEVENT" and then one more frame of animation is done. But because the animation is not complete, the rendering thread itself posts a "SDL_ANIMATEEVENT" to its own queue, that'll trigger it to wake up again, when it reaches SDL_WaitEvent.
And another idea there is that SDL events can carry data too. So you could supply, say, an animation ID in "data1" and a "current frame" counter in "data2" with the event. So that when the thread picks up the "SDL_ANIMATEEVENT", the event itself tells it which animation to do and what frame we're currently on.
This is a "best of both worlds" solution, I feel. It can behave like SDL_WaitEvent or SDL_PollEvent at the application's discretion by just sending messages to itself.
For a game, this might not be worth it, as you're updating frames constantly, so there's no big advantage to this and maybe it's not worth bothering with (though even games could benefit from going to 0% CPU usage in the pause screen or in-game menus, to let the CPU cool down and use less laptop battery).
But for something like a GUI - which has more "burst-y" animation - then a mouse event can trigger an animation (e.g. opening a new window, which zooms or slides into view) that sends "SDL_ANIMATEEVENT" back to the queue. And it keeps doing that until the animation is complete, then falls back to normal SDL_WaitEvent behaviour again.
It's an idea that might fit what some people need, so I thought I'd float it here for general consumption.
You could actually initialise the SDL and the window in the main thread and then create 2 more threads for updates(Just updates game states and variables as time passes) and rendering(renders the surfaces accordingly).
Then after all that is done, use SDL_WaitEvent in your main thread to manage SDL_Events. This way you could ensure that event is managed in the same thread that called the sdl_init.
I have been using this method for long to make my games work in windows and linux and have been able to successfully run 3 threads at the same time as mentioned above.
I had to use mutex to make sure that textures/surfaces can be transformed/changed in the update thread as well by pausing the render thread, and the lock is called every once 60 frames, so its not going to cause major perf issues.
This model works best to create event driven games, run time games, or both.
Having a bit of an issue with a game I'm making using opengl. The game will sometimes run at half speed and sometimes it will run normally.
I don't think it is the opengl causing the problem since it runs at literally 14,000 fps on my computer. (even when its running at half speed)
This has led me to believe that is is the "game timer" thats causing the problem. The game timer runs on a seperate thread and is programmed to pause at the end of its "loop" with a Sleep(5) call. if i remove the Sleep(5) call, it runs so fast that i can barely see the sprites on the screen. (predictable behavior)
I tried throwing a Sleep(16) at the end of the Render() thread (also on its own thread). This action should limit the fps to around 62. Remember that the app runs sometimes at its intended speed and sometimes at half speed (i have tried on both of the computers that i own and it persists).
When it runs at its intended speed, the fps is 62 (good) and sometimes 31-ish (bad). it never switches between half speed and full speed mid execution, and the problem persists even after a reboot..
So its not the rendering that causing the slowness, its the Sleep() function
I guess what im saying is that the Sleep() function is inconsistent with the times that it actually sleeps. is this a proven thing? is there a better Sleep() function that i could use?
A waitable timer (CreateWaitableTimer and WaitForSingleObject or friends) is much better for periodic wakeup.
However, in your case you probably should just enable VSYNC.
See the following discussion of the Sleep function, focusing on the bit about scheduling priorities:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx
yes, Sleep function is inconsistency, it is very useful in the case of macro condition.
if you want to a consistency time,please use QueryPerformanceFrequency get the frequency of CPU, and QueryPerformanceCount twice for start and end, and then (end-start) / frequency get the consistency time, but you must look out that if your CPU is mulit cores, the start and end time maybe not the same CPU core, so please us SetThreadAffinity for you working thread set the same CPU core.
Had a same problem. For I just made my own sleep logic and worked for me.
#include <chrono>
using namespace std::chrono;
high_resolution_clock::time_point sleep_start_time = high_resolution_clock::now();
while (duration_cast<duration<double>>(high_resolution_clock::now() - sleep_start_time).count() < must_sleep_duration) {}
I would like to achieve determinism in my game engine, in order to be able to save and replay input sequences and to make networking easier.
My engine currently uses a variable timestep: every frame I calculate the time it took to update/draw the last one and pass it to my entities' update method. This makes 1000FPS games seem as fast ad 30FPS games, but introduces undeterministic behavior.
A solution could be fixing the game to 60FPS, but it would make input more delayed and wouldn't get the benefits of higher framerates.
So I've tried using a thread (which constantly calls update(1) then sleeps for 16ms) and draw as fast as possible in the game loop. It kind of works, but it crashes often and my games become unplayable.
Is there a way to implement threading in my game loop to achieve determinism without having to rewrite all games that depend on the engine?
You should separate game frames from graphical frames. The graphical frames should only display the graphics, nothing else. For the replay it won't matter how many graphical frames your computer was able to execute, be it 30 per second or 1000 per second, the replaying computer will likely replay it with a different graphical frame rate.
But you should indeed fix the gameframes. E.g. to 100 gameframes per second. In the gameframe the game logic is executed: stuff that is relevant for your game (and the replay).
Your gameloop should execute graphical frames whenever there is no game frame necessary, so if you fix your game to 100 gameframes per second that's 0.01 seconds per gameframe. If your computer only needed 0.001 to execute that logic in the gameframe, the other 0.009 seconds are left for repeating graphical frames.
This is a small but incomplete and not 100% accurate example:
uint16_t const GAME_FRAMERATE = 100;
uint16_t const SKIP_TICKS = 1000 / GAME_FRAMERATE;
uint16_t next_game_tick;
Timer sinceLoopStarted = Timer(); // Millisecond timer starting at 0
unsigned long next_game_tick = sinceLoopStarted.getMilliseconds();
while (gameIsRunning)
{
//! Game Frames
while (sinceLoopStarted.getMilliseconds() > next_game_tick)
{
executeGamelogic();
next_game_tick += SKIP_TICKS;
}
//! Graphical Frames
render();
}
The following link contains very good and complete information about creating an accurate gameloop:
http://www.koonsolo.com/news/dewitters-gameloop/
To be deterministic across a network, you need a single point of truth, commonly called "the server". There is a saying in the game community that goes "the client is in the hands of the enemy". That's true. You cannot trust anything that is calculated on the client for a fair game.
If for example your game gets easier if for some reasons your thread only updates 59 times a second instead of 60, people will find out. Maybe at the start they won't even be malicious. They just had their machines under full load at the time and your process didn't get to 60 times a second.
Once you have a server (maybe even in-process as a thread in single player) that does not care for graphics or update cycles and runs at it's own speed, it's deterministic enough to at least get the same results for all players. It might still not be 100% deterministic based on the fact that the computer is not real time. Even if you tell it to update every $frequence, it might not, due to other processes on the computer taking too much load.
The server and clients need to communicate, so the server needs to send a copy of it's state (for performance maybe a delta from the last copy) to each client. The client can draw this copy at the best speed available.
If your game is crashing with the thread, maybe it's an option to actually put "the server" out of process and communicate via network, this way you will find out pretty fast, which variables would have needed locks because if you just move them to another project, your client will no longer compile.
Separate game logic and graphics into different threads . The game logic thread should run at a constant speed (say, it updates 60 times per second, or even higher if your logic isn't too complicated, to achieve smoother game play ). Then, your graphics thread should always draw the latest info provided by the logic thread as fast as possible to achieve high framerates.
In order to prevent partial data from being drawn, you should probably use some sort of double buffering, where the logic thread writes to one buffer, and the graphics thread reads from the other. Then switch the buffers every time the logic thread has done one update.
This should make sure you're always using the computer's graphics hardware to its fullest. Of course, this does mean you're putting constraints on the minimum cpu speed.
I don't know if this will help but, if I remember correctly, Doom stored your input sequences and used them to generate the AI behaviour and some other things. A demo lump in Doom would be a series of numbers representing not the state of the game, but your input. From that input the game would be able to reconstruct what happened and, thus, achieve some kind of determinism ... Though I remember it going out of sync sometimes.