Threading Model for a Game Engine

Threading Model for a Game Engine - c++

I'm interested in getting threading into the small engine I'm working on in my spare time, but I'm curious over what the best approuch is. I'm curious about the recommended way to sync the physics thread with the rest of the engine, similar to ThisGuy. I'm working with the Bullet Physics SDK, which already use the data copy method he was describing, but I was wondering, once bullet goes through one simulation then syncs the data back to the other threads, won't it result in something like vertical sync, where the rendering thread, half way through processing data suddenly starts using a newer and different set of information?
Is this something which the viewer will be able to notice? What if an explosion of some sort appears with the object that is meant to be destroyed?
If this is an issue, what is then is the best way to solve it?
Lock the physics thread so it can't do anything until the rendering thread (And basically every other thread) has gone through its frame? That seems like it would waste some CPU time. Or is the preferable method to triple buffer, copy the physics data to a second location, continue the physics simulation then copy that data to the rendering thread once its ready?
What approaches do you guys recommend?

The easiest and probably most used variant is to run physic, render, ai, ... threads in parallel and syncronise them after each of them has finished with a frame/timestep.
This is not the fastest solution, but the one with the fewest problems.
Writing back the data to the rendering thread while this is running, leads to massive syncronisation problems (e.g. you have to lock each vector/matrix while updating it).
To make the paralellisation efficent, you have to minimize the amount of data to syncronize, e.g. only write data to the render thread, that can possible be rendered.
When not synronizing after each frame, you can probably get the effect, that the physic/ai uses all the cpu power producing 60fps, while the renderer only renders 10fps, which in most cases is not, what you want.
A double buffering would also increase performance, but you still need to syncronize your threads. A problem is ai and physic or similar threads, because they possible want modify the same data

Related

Making GL textures uploading async and figuring out when it's done uploading

I have run into an issue where my application requires loading images dynamically at runtime. This ends up being a problem because it's not possible to load them all up front since I can't know which ones will be used... otherwise I have to upload everything. The problem is that some people do not have good PCs and have been complaining that loading all the images to the GPU takes a long time for them due to bad hardware.
My work around for the latter group was to upload textures just as is needed, and this worked for the most part. The problem is that during the application, there are times where delays occur when a series of images need to be uploaded and there's a delay due to the uploading.
I was researching how to get around this, and I have an idea: Users want a smooth experience and are okay if the textures are not immediately loaded but instead are absent. This makes it easy, as I can upload in the background and then just draw nothing for where the object should be, and then bring it into existence after it is done. This is acceptable because the uploads are usually pretty fast anyways, but they're slow enough that it causes it to dip under 60 fps for some people which causes some stutter. On average it causes anywhere from 1-3 frames of stutter so the uploads do resolve fast and on average less than 50ms.
My solution was to attempt something using a PBO to get some async-like uploading. Problem is I cannot find online how I can tell when the uploading is done. Is there a way to do this?
I figure there are four options:
There's a way to do what I want with OpenGL 3.1 onwards and that will be that
It is not possible to do (1), but I could place a fence in and then check if the fence is done, however I've never had to do this before so I'm not sure if it would work in this case
It's not possible, but I could make the assumption that everything would be uploaded in < 50ms and use some kind of timestamp to tell if it's drawable or not and just hope that it is the case (and if the time is < 50ms since issuing an upload, then draw nothing)
It's not possible do to this for texture uploading and I'm stuck
This leads me to my question: Can I tell when an asynchronous upload of pixels to a texture is done?

Fence sync objects tell when all previously issued commands have completed their execution. This includes asynchronous pixel transfer operations. So you can issue a fence after your transfers and use the sync object tools to check to see when it is done.
The annoying issue you'll have here is that it's very coarse-grained. Testing the fence also includes testing whether any non-transfer commands have also completed, despite the fact that the two operations are probably being handled by independent hardware. So if the transfer completes before the frame rendered before starting the transfer has finished rendering, the fence still won't be set. However, if you fire off a lot of texture uploading all at once, then the transfer operations will dominate the results.

OpenGL rendering/updating loop issues

I'm wondering how e.g. graphic (/game) engines do their job with lot's of heterogeneous data while a customized simple rendering loop turns into a nightmare when you have some small changes.
Example:
First, let's say we have some blocks in our scene.
Graphic-Engine: create cubes and move them
Customized: create cube template for vertices, normals, etc. copy and translate them to the position and copy e.g. in a vbo. One glDraw* call does the job.
Second, some weird logic. We want block 1, 4, 7, ... to rotate on x-axis, 2, 5, 8, ... on y-axis and 3, 6, 9 on z-axis with a rotation speed linear to the camera distance.
Graphic-Engine: manipulating object's matrix and it works
Customized: (I think) per object glDraw* call with changing model-matrix uniform is not a good idea, so a translation matrix should be something like an attribute? I have to update them every frame.
Third, a block should disappear if the distance to the camera is lower than any const value Q.
Graphic-Engine: if (object.distance(camera) < Q) scene.drop(object);
Customized: (I think) our vbo is invalid and we have to recreate it?
Again to the very first sentence: it feels like engines do those manipulations for free, while we have to rethink how to provide and update data. And while we do so, the engine (might, but I actually don't know) say: 'update whatever you want, at least I'm going to send all matrizes'.
Another Example: What about a voxel-based world (e.g. Minecraft) where we only draw the visible surface, and we are able to throw a bomb and destroy many voxels. If the world's view data is in one huge buffer we only have one glDraw*-call but have to recreate the buffer every time then. If there are smaller chunks, we have many glDraw*-calls and also have to manipulate buffers, which are smaller.
So is it a good deal to send let's say 10MB of buffer update data instead of 2 gl*-calls with 1MB? How many updates are okay? Should a rendering loop deal with lazy updates?
I'm searching for a guide what a 60fps application should be able to update/draw per frame to get a feeling of what is possible. For my tests, every optimization try is another bottleneck.
And I don't want those tutorials which says: hey there is a new cool gl*Instance call which is super-fast, buuuuut you have to check if your gpu supports it. Well, I also rather consider this an optimization than a meaningful implementation at first.
Do you have any ideas, sources, best practices or rule of thumb how a rendering/updating routine best play together?
My questions are all nearly the same:
How many updates per frame are okay on today's hardware?
Can I lazy-load data to have it after a few frames, but without freezing my application
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
Thank you.

It's unclear how any of your questions relate to your "graphics engines" vs "customized" examples. All the updates you do with a "graphics engines" are translated to those OpenGL calls in the end.
In brief:
How many updates per frame are okay on today's hardware?
Today's PCIe bandwidth is huge (can go as high as 30 GB/s). However, to utilize it in its entirety you have to reduce the number transactions via consolidating OpenGL calls. The exact number of updates entirely depends on the hardware, drivers, and the way you use them, and graphics hardware is diverse.
This is the kind of answer you didn't want to hear, but unfortunately you have to face the truth: to reduce the number of OpenGL calls you have to use the newer version APIs. E.g. instead of setting each uniform individually you are better to submit a bunch of them through uniform shader buffer objects. Instead of submitting each MVP of each model individually, it's better to use instanced rendering. And so on.
An even more radical approach would be to move to a lower-level (and newer) API, i.e. Vulkan, which aims to solve exactly this problem: the cost of submitting work to the GPU.
Can I lazy-load data to have it after a few frames, but without freezing my application
Yes, you can upload buffer objects asynchronously. See Buffer Object Streaming for details.
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
You don't need any of these if you do it asynchronously.

My multithreaded game is at 100% CPU all the time. How can I manage thread activity to reduce the CPU load?

I have a DirectX game which spawns 2 boost threads on a dual-core system: 1 for gameplay/rendering (normally split into their own threads on a quad-core CPU), and 1 other thread which procedurally-generates the gameworld. I believe that my audio middleware also spawns its own threads for playing SFX and music.
The game is always running at 100% on the CPU, which in turn can cause some sputtering from the audio system. I'm hoping that I can reduce the CPU load by better managing the activity of that Generation Thread. While sometimes I need it running full-speed, there are other times (when the player isn't moving much) when it is just updating constantly while not really doing a whole lot.
Is it possible / advisable to manually manage how active a thread is? If so what strategies can I use to do that? I keep seeing people say that sleep() functions aren't really recommended, but I don't really know what else to do yet.
Alternatively, am I barking up the wrong tree by trying to squeeze cycles out of thread management, and would I be better served by traditional profiling/optimization?

Getting to 100% processor utilization means that you don't have a game clock. You are probably rendering frames as fast as the machine allows. Still pretty hard to get 100% exactly if you use multiple threads, that indicates that you don't synchronize threads either.
This is likely to require a pretty drastic rewrite. The pace ought to be set by the main render loop, the one that copies the back-buffer to the video adapter. It sets your target FPS, frames-per-second. Not infrequently, you use the vertical blanking interval of the monitor for that, it solves tearing problems by ensuring that the monitor gets updated at the exact right time. Which automatically gets the render loop paced to the monitor refresh rate. Typically 60 times per second on LCD monitors. A timer is an alternative. This prevents the main thread from burning 100% core, assuming it can keep up with the FPS.
You now have a steady game clock tick, discrete moments in time at which things need to happen and jobs need to be completed in order to update the game state. Like checking for player input. Inside the render loop, check for mouse/keyboard/controller input and use anything you get to update the game world objects.
And in turn determines what worker threads need to do. They'll have the duration of one pass through the render loop to get the job done they need to do. You use a synchronization object to wake them up. And another one, each, to let them signal that they are done with the current game loop tick. Which stops them from burning core, they should constantly be waiting on the signal to start working on the next frame. Note that there is a balancing requirement. If a worker thread needs more than one game tick to get the job done then the render loop will fall behind and miss a video adapter frame update. Your video starts to stutter. This is in general impossible to eliminate completely, do make sure that it doesn't affect the absolute game clock.
Audio should be the easier problem to solve, you just need to keep the sound card buffers filled with sufficient data to survive a couple of frames worth of sound.
Falling behind on the target FPS is very easy to determine. You automatically compensate for that by lowering the target FPS. So the program still runs acceptably on a slow machine, just not as smoothly. Net effect is that you'll stop burning 100% core on all threads.

I don't know anything about your system so this answer may be way off. I assume that there is such a thing as audio output buffer of some sort that you can track the size of. When the size of the buffer is so small that there is a danger that audio may stop you should do something to refill the buffer.
That "something" may be as easy as temporarily setting the priority of audio thread to higher value. Come to think of why not set it higher from the start? That would solve all the problems, right? Even better, just lower the priority of world generator thread.

When I worked on a voice-comm app for gamers many years ago, we hit this problem a lot. Many games are written to use every ounce of CPU. As such, some gamest would starve our app (that ran in the background) from functioning - causing audio drops and lost network connections. Many of those games would also call SetThreadPriority and SetPriorityClass with the REALTIME flags to basically consume all the CPU quantums with disregard of anything else running on the system.
The typical fix we asked of game developers we partnered with was to simply insert a "Sleep(0)" call between each frame of their main game loop so that our threads wouldn't get stalled. I think we later added a switch in a software update to make our process run at a higher priority mode. Since then, Windows has gotten better about multitasking and thread priority with respect to these issues.

A way of generating chunks

I'm making a game and I'm actually on the generation of the map.
The map is generated procedurally with some algorithms. There's no problems with this.
The problem is that my map can be huge. So I've thought about cutting the map in chunks.
My chunks are ok, they're 512*512 pixels each, but the only problem is : I have to generate a texture (actually a RenderTexture from the SFML). It takes around 0.5ms to generate so it makes the game to freeze each time I generate a chunk.
I've thought about a way to fix this : I've made a kind of a threadpool with a factory. I just have to send a task to it and it creates the chunk.
Now that it's all implemented, it raises opengl warnings like :
"An internal OpenGL call failed in RenderTarget.cpp (219) : GL_INVALID_OPERATION, the specified operation is not allowed in the current state".
I don't know if this is the good way of dealing with chunks. I've also thought about saving the chunks into images / files, but I fear that it take too much time to save / load them.
Do you know a better way to deal with this kind of "infinite" maps ?

It is an invalid operation because you must have a context bound to each thread. More importantly, all of the GL window system APIs enforce a strict 1:1 mapping between threads and contexts... no thread may have more than one context bound and no context may be bound to more than one thread. What you would need to do is use shared contexts (one context for drawing and one for each worker thread), things like buffer objects and textures will be shared between all shared contexts but the state machine and container objects like FBOs and VAOs will not.
Are you using tiled rendering for this map, or is this just one giant texture?
If you do not need to update individual sub-regions of your "chunk" images you can simply create new textures in your worker threads. The worker threads can create new textures and give them data while the drawing thread goes about its business. Only after a worker thread finishes would you actually try to draw using one of the chunks. This may increase the overall latency between the time a chunk starts loading and eventually appears in the finished scene but you should get a more consistent framerate.
If you need to use a single texture for this, I would suggest you double buffer your texture. Have one that you use in the drawing thread and another one that your worker threads issue glTexSubImage2D (...) on. When the worker thread(s) finish updating their regions of the texture you can swap the texture you use for drawing and updating. This will reduce the amount of synchronization required, but again increases the latency before an update eventually appears on screen.

things to try:
make your chunks smaller
generate the chunks in a separate thread, but pass to the gpu from the main thread
pass to the gpu a small piece at a time, taking a second or two

Asynchronous screen update to gameplay logic, C++

I am programming a game using Visual C++ 2008 Express and the Ogre3D sdk.
My core gameplay logic is designed to run at 100 times/second. For simplicity, I'll say it's a method called 'gamelogic()'. It is not time-based, which means if I want to "advance" game time by 1 second, I have to call 'gamelogic()' 100 times. 'gamelogic()' is lightweight in comparison to the game's screen rendering.
Ogre has a "listener" logic that informs your code when it's about to draw a frame and when it has finished drawing a frame. If I just call 'gamelogic()' just before the frame rendering, then the gameplay will be greatly affected by screen rendering speed, which could vary from 5fps to 120 fps.
The easy solution that comes to mind is : calculate the time elapsed since last rendered frame and call 'gamelogic()' this many times before the next frame: 100 * timeElapsedInSeconds
However, I pressume that the "right" way to do it is with multithreading; have a separate thread that runs 'gamelogic()' 100 times/sec.
The question is, how do I achieve this and what can be done when there is a conflict between the 2 separate threads : gamelogic changing screen content (3d object coordinates) while Ogre is rendering the screen at the same time .
Many thanks in advance.

If this is your first game application, using multi-threading to achieve your results might be more work than you should really tackle on your first game. Sychronizing a game loop and render loop in different threads is not an easy problem to solve.
As you correctly point out, rendering time can greatly affect the "speed" of your game. I would suggest that you do not make your game logic dependent on a set time slice (i.e. 1/100 of a second). Make it dependent on the current frametime (well, the last frametime since you don't know how long your current frame will take to render).
Typically I would write something like below (what I wrote is greatly simplified):
float Frametime = 1.0f / 30.0f;
while(1) {
game_loop(Frametime); // maniuplate objects, etc.
render_loop(); // render the frame
calculate_new_frametime();
}
Where Frametime is the calculcated frametime that the current frame took. When you process your game loop you are using the frametime from the previous frame (so set the initial value to something reasonable, like 1/30th or 1/15th of a second). Running it on the previous frametime is close enough to get you the results that you need. Run your game loop using that time frame, then render your stuff. You might have to change the logic in your game loop to not assume a fixed time interval, but generally those kinds of fixes are pretty easy.
Asynchoronous game/render loops may be something that you ultimately need, but that is a tough problem to solve. It involves taking snapshops of objects and their relevant data, putting those snapshots into a buffer and then passing the buffer to the rendering engine. That memory buffer will have to be correctly partitioned around critical sections to avoid having the game loop write to it while the render loop is reading from it. You'll have to take care to make sure that you copy all relevant data into the buffer before passing to the render loop. Additionally, you'll have to write logic to stall either the game or render loops while waiting for one or the other to complete.
This complexity is why I suggest writing it in a more serial manner first (unless you have experience, which you might). The reason being is that doing it the "easy" way first will force you to learn about how your code works, how the rendering engine works, what kind of data the rendering engine needs, etc. Multithreading knowledge is defintely required in complex game development these days, but knowing how to do it well requires indepth knowledge of how game systems interact with each other.

There's not a whole lot of benefit to your core game logic running faster than the player can respond. About the only time it's really useful is for physics simulations, where running at a fast, fixed time step can make the sim behave more consistently.
Apart from that, just update your game loop once per frame, and pass in a variable time delta instead of relying on the fixed one. The benefit you'll get from doing multithreading is minimal compared to the cost, especially if this is your first game.

Double buffering your render-able objects is an approach you could explore. Meaning, the rendering component is using 1 buffer which is updated when all game actions have updated the relevant object in the 2nd buffer.
But personally I don't like it, I'd (and have, frequently) employ Mark's approach.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js