Infinite cube world engine (like Minecraft) optimization suggestions? - opengl

Voxel engine (like Minecraft) optimization suggestions?
As a fun project (and to get my Minecraft-adict son excited for programming) I am building a 3D Minecraft-like voxel engine using C# .NET4.5.1, OpenGL and GLSL 4.x.
Right now my world is built using chunks. Chunks are stored in a dictionary, where I can select them based on a 64bit X | Z<<32 key. This allows to create an 'infinite' world that can cache-in and cache-out chunks.
Every chunk consists of an array of 16x16x16 block segments. Starting from level 0, bedrock, it can go as high as you want (unlike minecraft where the limit is 256, I think).
Chunks are queued for generation on a separate thread when they come in view and need to be rendered. This means that chunks might not show right away. In practice you will not notice this. NOTE: I am not waiting for them to be generated, they will just not be visible immediately.
When a chunk needs to be rendered for the first time a VBO (glGenBuffer, GL_STREAM_DRAW, etc.) for that chunk is generated containing the possibly visible/outside faces (neighboring chunks are checked as well). [This means that a chunk potentially needs to be re-tesselated when a neighbor has been modified]. When tesselating first the opaque faces are tesselated for every segment and then the transparent ones. Every segment knows where it starts within that vertex array and how many vertices it has, both for opaque faces and transparent faces.
Textures are taken from an array texture.
When rendering;
I first take the bounding box of the frustum and map that onto the chunk grid. Using that knowledge I pick every chunk that is within the frustum and within a certain distance of the camera.
Now I do a distance sort on the chunks.
After that I determine the ranges (index, length) of the chunks-segments that are actually visible. NOW I know exactly what segments (and what vertex ranges) are 'at least partially' in view. The only excess segments that I have are the ones that are hidden behind mountains or 'sometimes' deep underground.
Then I start rendering ... first I render the opaque faces [culling and depth test enabled, alpha test and blend disabled] front to back using the known vertex ranges. Then I render the transparent faces back to front [blend enabled]
Now... does anyone know a way of improving this and still allow dynamic generation of an infinite world? I am currently reaching ~80fps#1920x1080, ~120fps#1024x768 (screenshots: http://i.stack.imgur.com/t4k30.jpg, http://i.stack.imgur.com/prV8X.jpg) on an average 2.2Ghz i7 laptop with a ATI HD8600M gfx card. I think it must be possible to increase the number of frames. And I think I have to, as I want to add entity AI, sound and do bump and specular mapping. Could using Occlusion Queries help me out? ... which I can't really imagine based on the nature of the segments. I already minimized the creation of objects, so there is no 'new Object' all over the place. Also as the performance doesn't really change when using Debug or Release mode, I don't think it's the code but more the approach to the problem.
edit: I have been thinking of using GL_SAMPLE_ALPHA_TO_COVERAGE but it doesn't seem to be working?
gl.Enable(GL.DEPTH_TEST);
gl.Enable(GL.BLEND); // gl.Disable(GL.BLEND);
gl.Enable(GL.MULTI_SAMPLE);
gl.Enable(GL.SAMPLE_ALPHA_TO_COVERAGE);

To render a lot of similar objects, I strongly suggest you take a look into instanced draw : glDrawArraysInstanced and/or glDrawElementsInstanced.
It made a huge difference for me. I'm talking from 2 fps to over 60 fps to render 100000 similar icosahedrons.
You can parametrize your cubes by using Attribs ( glVertexAttribDivisor and friends ) to make them differents. Hope this helps.

It's on ~200fps currently, should be OK. The 3 main things that I've done are:
1) generation of both chunks on a separate thread.
2) tessellation the chunks on a separate thread.
3) using a Deferred Rendering Pipeline.
Don't really think the last one contributed much to the overall performance but had to start using it because of some of the shaders. Now the CPU is sort of falling asleep # ~11%.

This question is pretty old, but I'm working on a similar project. I approached it almost exactly the same way as you, however I added in one additional optimization that helped out a lot.
For each chunk, I determine which sides are completely opaque. I then use that information to do a flood fill through the chunks to cull out the ones that are underground. Note, I'm not checking individual blocks when I do the flood fill, only a precomputed bitmask for each chunk.
When I'm computing the bitmask, I also check to see if the chunk is entirely empty, since empty chunks can obviously be ignored.

Related

OpenGL rendering/updating loop issues

I'm wondering how e.g. graphic (/game) engines do their job with lot's of heterogeneous data while a customized simple rendering loop turns into a nightmare when you have some small changes.
Example:
First, let's say we have some blocks in our scene.
Graphic-Engine: create cubes and move them
Customized: create cube template for vertices, normals, etc. copy and translate them to the position and copy e.g. in a vbo. One glDraw* call does the job.
Second, some weird logic. We want block 1, 4, 7, ... to rotate on x-axis, 2, 5, 8, ... on y-axis and 3, 6, 9 on z-axis with a rotation speed linear to the camera distance.
Graphic-Engine: manipulating object's matrix and it works
Customized: (I think) per object glDraw* call with changing model-matrix uniform is not a good idea, so a translation matrix should be something like an attribute? I have to update them every frame.
Third, a block should disappear if the distance to the camera is lower than any const value Q.
Graphic-Engine: if (object.distance(camera) < Q) scene.drop(object);
Customized: (I think) our vbo is invalid and we have to recreate it?
Again to the very first sentence: it feels like engines do those manipulations for free, while we have to rethink how to provide and update data. And while we do so, the engine (might, but I actually don't know) say: 'update whatever you want, at least I'm going to send all matrizes'.
Another Example: What about a voxel-based world (e.g. Minecraft) where we only draw the visible surface, and we are able to throw a bomb and destroy many voxels. If the world's view data is in one huge buffer we only have one glDraw*-call but have to recreate the buffer every time then. If there are smaller chunks, we have many glDraw*-calls and also have to manipulate buffers, which are smaller.
So is it a good deal to send let's say 10MB of buffer update data instead of 2 gl*-calls with 1MB? How many updates are okay? Should a rendering loop deal with lazy updates?
I'm searching for a guide what a 60fps application should be able to update/draw per frame to get a feeling of what is possible. For my tests, every optimization try is another bottleneck.
And I don't want those tutorials which says: hey there is a new cool gl*Instance call which is super-fast, buuuuut you have to check if your gpu supports it. Well, I also rather consider this an optimization than a meaningful implementation at first.
Do you have any ideas, sources, best practices or rule of thumb how a rendering/updating routine best play together?
My questions are all nearly the same:
How many updates per frame are okay on today's hardware?
Can I lazy-load data to have it after a few frames, but without freezing my application
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
Thank you.
It's unclear how any of your questions relate to your "graphics engines" vs "customized" examples. All the updates you do with a "graphics engines" are translated to those OpenGL calls in the end.
In brief:
How many updates per frame are okay on today's hardware?
Today's PCIe bandwidth is huge (can go as high as 30 GB/s). However, to utilize it in its entirety you have to reduce the number transactions via consolidating OpenGL calls. The exact number of updates entirely depends on the hardware, drivers, and the way you use them, and graphics hardware is diverse.
This is the kind of answer you didn't want to hear, but unfortunately you have to face the truth: to reduce the number of OpenGL calls you have to use the newer version APIs. E.g. instead of setting each uniform individually you are better to submit a bunch of them through uniform shader buffer objects. Instead of submitting each MVP of each model individually, it's better to use instanced rendering. And so on.
An even more radical approach would be to move to a lower-level (and newer) API, i.e. Vulkan, which aims to solve exactly this problem: the cost of submitting work to the GPU.
Can I lazy-load data to have it after a few frames, but without freezing my application
Yes, you can upload buffer objects asynchronously. See Buffer Object Streaming for details.
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
You don't need any of these if you do it asynchronously.

Proper Implementation of Texture Atlas

I'm currently working alongside a piece of software that generates game maps by taking several images and then tiling them into a game map. Right now I'm working with OpenGL to draw these maps. As you know, switching states in OpenGL and making multiple draw calls is costly. I've decided to implement a texture atlas system, which would allow me to draw the entire map in a single draw call with no state switching. However, I'm having a problem with implementing the texture atlas. Firstly, would it be better to store each TILE in the texture atlas, or the images themselves? Secondly, not all of the images are guaranteed to be square, or even powers of two. Do I pad them to the nearest power of two, a square, or both? Another thing that concerns me is that the images can get quite large, and I'm worried about exceeding the OpenGL size limitation for textures, which would force me to split the map up, ruining the entire concept.
Here's what I have so far, conceptually:
-Generate texture
-Bind texture
-Generate image large enough to hold textures (Take padding into account?)
-Sort textures?
-Upload subtexture to blank texture, store offsets
-Unbind texture
This is not so much a direct answer, but I can't really answer directly since you are asking many questions at once. I'll simply try to give you as much info as I can on the related subjects.
The following is a list of considerations for you, allowing you to rethink exactly what your priorities are and how you wish to execute them.
First of all, in my experience (!!), using texture arrays is much easier than using a texture atlas, and the performance is about equal. Texture arrays do exactly what you think they would do, you can sample them in shaders based on a variable name and an index, instead of just a name (ie: mytexarray[0]). One of the big drawbacks include having the same texture size for all textures in the array, advantages being: easy indexing of subtextures and binding in one draw call.
Second of all, always use powers of 2. I don't know if some recent systems allow for non-power of 2 textures totally without problems, but (again in my experience) it is best to use powers of 2 everywhere. One of the problems I had in a 500*500 texture was black lines when drawing textured quads, these black lines were exactly the size needed to pad to a nearest power of two (12 pixels on x and y). So OpenGL somewhat creates this problem for you even on recent hardware.
Third of all (is this even english?), concerning size. All your problems seem to handle images, textures. You might want to look at texturebuffers, they allow for large amounts of data to be streamed to your GC and are updated easier than textures (this allows for LOD map systems). This is mostly nice if you use textures but only need the data in them represented in their colors, not the colors directly.
Finally you might want to look at "texture splatting", this is a way to increase detail without increasing data. I don't know exactly what you are making so I don't know if you can use it, but it's easy and it's being used in the game industry alot. You create a set of textures (rock, sand, grass, etc) you use everywhere, and one big texture keeping track of which smaller texture is applied where.
I hope at least one of the things I wrote here will help you out,
Good luck!
PS: openGL texture size limitations depend on the graphics card of the user, so be careful with sizes greater than 2048*2048, even if your computer runs fine others might have serious issues. Safe values are anything upto 1024*1024.
PSS: excuse any grammer mistakes, ask for clarification if needed. Also, this is my first answer ever, excuse my lack of protocol.

2D pixel-perfect collision detection with opengl

I'm writing a 2D, sprite-based game and I'm having a hard time with making collision detection. First of all, I am well aware of other methods and in fact I'm using Box2D's quadtree queries to filter out non-overlapping sprites. So pixel-perfect detection would be used only on sprites that were found to overlap and would be used only a few times per frame. The sprites are rotating but not scaling.
The problem is I need it done with pixels because the sprites can change over time and making and using e.g. Box2D's geometric shapes for approximate the bitmap will get really complicated.
I did some research and found out these methods are possible in OpenGL in order to check if any pixels with non-zero alpha channel overlap:
1) Rendering sprites to a texture/buffer with e.g. 50% alpha and proper blending function, copying the result to RAM and checking if there's any pixel with alpha greater with e.g. 80%.
This method is simple but as I checked copying back is extremely slow.
2) Using OpenGL's occlusion query.
From what I found out on the net occlusion queries can be tricky (plus sometimes you need to wait until the next frame to get the result) and buggy on some graphic cards. The fact such queries don't produce results immediately is a deal breaker because of how the game is designed to work.
3) Shaders and atomic counters.
I'm not sure if it would work but it seems that using a fragment shader when rendering a second sprite that would increase an atomic counter each time it overwrites something and then checking the counter's value on the CPU side could be a solution. The only problem is that atomic counters are pretty new and 2,3-years old machines may not support them.
Is there something I missed? Or should I just forgot about using GPU and write my own renderer just for collision detection on CPU?
Atomic Counters is an appropriate way to do this on the GPU. Since you're going to be checking many many pixels, you might as well do this in parallel. The big performance question here is asynchronously reading it back, but this depends on how you make your engine of course.
With OpenGL 4.2 you can use atomic counters. Check if your graphics card supports this, it's quite possible it does, you should check this.

Questions about texture atlas and render sorting - OpenGL

I am emulating an old client and most of the textures are 128x128. As I know switching textures and render calls are expensive, could I get away with at runtime, creating a very large texture atlas of all the tiny textures.
Then from there, could I bind the large texture and render via shaders with the texture offset in the atlas? What kind of performance hit would this have?
My second question is involving sorting. The level files are broken up into little chunks of a BSP tree. They are very small and there are often thousands per level. The current way, which is definitely slower is that I render each group of textures in each BSP leaf that are in my frustum and PVS (Quake 3 style). What would be the idea fix for this?
Would I want to run through each region (back to front) I could see, and group all of the visible triangles by texture and then render all at once. For some reason, I always feel this may be slower. Does it make more sense to sort first and render all at once or skip sorting and render each region at a time?
Yes, it should be possible to stuff textures into a texture atlas. Careful consideration would need to be given to issues like whether the textures are being tiled, and what you want to happen with mip-maps and filtering.
I'd probably suggest holding off on doing that until you've got the geometry sorted out - reducing the number of draw calls to a sane number may well be all you need to do. The cost of changing a texture isn't necessarily that bad on modern hardware.
As for rendering the level, on modern hardware it's more usual to render front to back, rather than back to front, in order to take advantage of the z-buffer and early culling (i.e. if you have a wall right in front of the camera, and you draw that first with z-buffering enabled, the hardware is fairly good at rejecting stuff you then try to draw behind it).
One possible approach would be to reprocess the BSP into a rougher structure, such as a simple grid (or maybe a quad or oct tree). You don't even need to split polygons neatly, just have a bunch of sectors with a bounding-box around them, and frustum-cull then loosely sort (front to back) the boxes. You can keep a PVS with this approach too, but the usefulness of that might drop when your rendering chunks become larger and coarser.
However before doing any of this, I'd definitely recommend setting up some benchmark tests and recording performance information. You won't know for sure if you're doing the right thing unless you actually analyse the performance. If you can pinpoint exactly what the worst performant thing you are doing is, you just need to fix that.

OpenGL Picking from a large set

I'm trying to, in JOGL, pick from a large set of rendered quads (several thousands). Does anyone have any recommendations?
To give you more detail, I'm plotting a large set of data as billboards with procedurally created textures.
I've seen this post OpenGL GL_SELECT or manual collision detection? and have found it helpful. However it can take my program up to several minutes to complete a rendering of the full set, so I don't think drawing 2x (for color picking) is an option.
I'm currently drawing with calls to glBegin/glVertex.../glEnd. Given that I made the switch to batch rendering on the GPU with vao's and vbo's, do you think I would receive a speedup large enough to facilitate color picking?
If not, given all of the recommendations against using GL_SELECT, do you think it would be worth me using it?
I've investigated multithreaded CPU approaches to picking these quads that completely sidestep OpenGL all together. Do you think a OpenGL-less CPU solution is the way to go?
Sorry for all the questions. My main question remains to be, whats a good way that one can pick from a large set of quads using OpenGL (JOGL)?
The best way to pick from a large number of quad cannot be easily defined. I don't like color picking or similar techniques very much, because they seem to be to impractical for most situations. I never understood why there are so many tutorials that focus on people that are new to OpenGl or even programming focus on picking that is just useless for nearly everything. For exmaple: Try to get a pixel you clicked on in a heightmap: Not possible. Try to locate the exact mesh in a model you clicked on: Impractical.
If you have a large number of quads you will probably need a good spatial partitioning or at least (better also) a scene graph. Ok, you don't need this, but it helps A LOT. Look at some tutorials for scene graphs for further information's, it's a good thing to know if you start with 3D programming, because you get to know a lot of concepts and not only OpenGl code.
So what to do now to start with some picking? Take the inverse of your modelview matrix (iirc with glUnproject(...)) on the position where your mouse cursor is. With the orientation of your camera you can now cast a ray into your spatial structure (or your scene graph that holds a spatial structure). Now check for collisions with your quads. I currently have no link, but if you search for inverse modelview matrix you should find some pages that explain this better and in more detail than it would be practical to do here.
With this raycasting based technique you will be able to find your quad in O(log n), where n is the number of quads you have. With some heuristics based on the exact layout of your application (your question is too generic to be more specific) you can improve this a lot for most cases.
An easy spatial structure for this is for example a quadtree. However you should start with they raycasting first to fully understand this technique.
Never faced such problem, but in my opinion, I think the CPU based picking is the best way to try.
If you have a large set of quads, maybe you can group quads by space to avoid testing all quads. For example, you can group the quads in two boxes and firtly test which box you
I just implemented color picking but glReadPixels is slow here (I've read somehere that it might be bad for asynchron behaviour between GL and CPU).
Another possibility seems to me using transform feedback and a geometry shader that does the scissor test. The GS can then discard all faces that do not contain the mouse position. The transform feedback buffer contains then exactly the information about hovered meshes.
You probably want to write the depth to the transform feedback buffer too, so that you can find the topmost hovered mesh.
This approach works also nice with instancing (additionally write the instance id to the buffer)
I haven't tried it yet but I guess it will be a lot faster then using glReadPixels.
I only found this reference for this approach.
I'm using the solution that I've borrowed from DirectX SDK, there's a nice example how to detect the selected polygon in a vertext buffer object.
The same algorithm works nice with OpenGL.