State of the art Culling and Batching techniques in rendering

State of the art Culling and Batching techniques in rendering - opengl

I'm currently working with upgrading and restructuring an OpenGL render engine. The engine is used for visualising large scenes of architectural data (buildings with interior), and the amount of objects can become rather large. As is the case with any building, there is a lot of occluded objects within walls, and you naturally only see the objects that are in the same room as you, or the exterior if you are on the outside. This leaves a large number of objects that should be occluded through occlusion culling and frustum culling.
At the same time there is a lot of repetative geometry that can be batched in renderbatches, and also a lot of objects that can be rendered with instanced rendering.
The way I see it, it can be difficult to combine renderbatching and culling in an optimal fashion. If you batch too many objects in the same VBO it's difficult to cull the objects on the CPU in order to skip rendering that batch. At the same time if you skip the culling on the cpu, a lot of objects will be processed by the GPU while they are not visible. If you skip batching copletely in order to more easily cull on the CPU, there will be an unwanted high amount of render calls.
I have done some research into existing techniques and theories as to how these problems are solved in modern graphics, but I have not been able to find any concrete solution. An idea a colleague and me came up with was restricting batches to objects relatively close to eachother e.g all chairs in a room or within a radius of n meeters. This could be simplified and optimized through use of oct-trees.
Does anyone have any pointers to techniques used for scene managment, culling, batching etc in state of the art modern graphics engines?

There's lots of information about frustum and occlusion culling on the internet.
Most of it comes from game developers.
Here's a list of some articles that will get you started:
http://de.slideshare.net/guerrillagames/practical-occlusion-culling-in-killzone-3
http://de.slideshare.net/TiagoAlexSousa/secrets-of-cryengine-3-graphics-technology
http://de.slideshare.net/Umbra3/siggraph-2011-occlusion-culling-in-alan-wake
http://de.slideshare.net/Umbra3/visibility-optimization-for-games
http://de.slideshare.net/Umbra3/chen-silvennoinen-tatarchuk-polygon-soup-worlds-siggraph-2011-advances-in-realtime-rendering-course
http://de.slideshare.net/DICEStudio/culling-the-battlefield-data-oriented-design-in-practice
http://www.cse.chalmers.se/~uffe/vfc.pdf
My (pretty fast) renderer works similar to this:
Collection: Send all props, which you want to render, to the renderer.
Frustum culling: The renderer culls the invisible props from the list using multiple threads in parallel.
Occlusion culling: Now you could do occlusion culling on the CPU (I haven't implemented it yet, because I don't need it now). Detailed information on how to do it efficiently can be found in the Killzone and Crysis slides. One solution would be to read back the depth buffer of the previous frame from the GPU and then rasterize the bounding boxes of the objects on top of it to check if the object is visible.
Splitting: Since you now know which objects actually have to be rendered, because they are visible, you have to split them by mesh, because each mesh has a different material or texture (otherwise they would be combined into a single mesh).
Batching: Now you have a list of meshes to render. You can sort them:
by depth (this can be done on the prop level instead of the mesh level), to save fillrate (I don't recommend doing this if your fragment shaders are very simple).
by mesh (because there might be multiple instances of the same mesh and it would make it easy to add instancing).
by texture, because texture switches are very costly.
Rendering: Iterate through your partitioned meshes and render them.
And as "Full Frontal Nudity" already said: There's no perfect solution.

Related

Hierachical Z-Buffering for occlusion culling

I'm reading the Occlusion Culling section in Real-Time Rendering 3rd Edition and I couldn't understand how it works. Some questions:
How does having a "Z-pyramid" contribute? Why do we need multiple resolutions of the Z-buffer? In the book, it's displayed as follows (left side):
Is the Octree structure the same Octree that is used for general frustum culling and rendering? Or is it a specialized Octree made just for the occlusion culling technique?
A more general question: In a previous section (and also here), the Occlusion Query term is described as "rendering a simplified bounding-volume of an object and comparing it's depth results to the Z-buffer, returning the amount of pixels that are visible." What functions in OpenGL are associated with this Occlusion Query concept?
Is this technique the standard for open-world games occlusion culling?

Hierarchical Z-buffer is useful in situations when large nearby objects are likely to occlude a lot of small farther objects. An example would be rendering an inside of a building or a mountainous landscape.
When the nearby object is rendered it would set a pixel somewhere on a lower-resolution level of the Z buffer pyramid to a near depth value. When a farther smaller object is being rendered, its bounding box can first be checked against that pixel and be culled in its entirety.
Yes. It's the same octrees. But it doesn't have to be an octree. Any hierarchical spacial indexing data structure would work for both, hierarchical Z-buffer or frustum culling.
In order to benefit from the hierarchical Z-buffer we would need to render the scenery starting with the nearest objects. Techniques like octrees or BSPs can be used for that.
Additionally, having an octree at hand lets us cull entire tree branches based on the distance to their bbox rather than separate objects or triangles.
The part of OpenGL that's responsible for occlusion queries is: glBeginQuery, glEndQuery and glGetQueryObject. See Query Objects for details.
Hierarchical Z buffers were implemented in hardware on some early Radeons. However I didn't hear of it being used nowadays.
Occlusion queries, on the other hand, are normally used. They, in essence, give similar benefits.

DirectX Transparent textures - Making plants look good

I've recently started learning DirectX programming in C++, I have some experience of graphical programming in other languages however I am new to the DirectX scene.
Anyway, I wanted to ask a question about transparent textures. So far I've always used alpha testing as that has reached my needs, however I've recently began to wonder how "proper" game engines manage to render such good looking semi-transparent textures for things like plants and trees which have smooth transparency.
As everytime I've used alpha testing, the texutres have ended up looking blocky and just plain bad. I'd love to be able to have smooth, semi-transparent textures which draw as I would expect.
My guess as to how this works would be to execute render calls in order, starting with things that are far away from the camera and moving closer, However, I can't really see how this works for pre-made models, for example if you had a tree model where the leaves and trunk shared a model, how to guarantee that the back leaves would draw, and the trunks would draw correctly over the leaves, and that the front leaves would look correct over the trunk.
I had tried that method above and had also disabled z buffering for the transparent objects such as smoke particles, and it sort of worked, but looked messy and the effect appeared different depending on the viewing angle. So that didn't seem ideal.
So, in short, what methods do "proper" games use to correctly draw smooth alpha textures (which have a range of alpha values) into a 3D scene for things like foliage.
Thanks,
Michael.

Ordered transparency is accomplished most basically using the painters algorithm.
The painter's algorithm falls apart where an object needs to be drawn both in front of and behind another object, or where a single object has multiple sub components that are transparent. We can't easily sort sub-components of a mesh relative to each other.
While it doesn't solve the problems z-buffer allows us to optimize rendering. Most games use this slightly more complex algorithm as the basis of their rendering.
Render all Opaque objects sorted by material state or front to back
to avoid overdraw.
Render all Transparent objects sorted front to back.
Games use a variety of techniques in combination to avoid this problem.
Split models into non overlapping transparent sections. Often times this is done implicitly because a game's transparent objects will often use use different materials than the rest of the model. You can also split models with multiple layers of transparency in such a way that each new model's layers do not overlap. For example you could split a pine tree model radially into 5 sections.
This was more common in fixed function pipelines. Modern games simply try to avoid the problem.
Avoid semi-transparent parts in models. Use transparency only for anti-aliasing edges and where the transparent object can split the world cleanly into two separate groups of objects. (Windows or water planes for example). Splitting the world like this and rendering those chunks front to back allows our anti-aliased edges to be drawn without causing obvious cut-outs on other transparent objects. The edges themselves tend to look good even if they overlap as long as your alpha-test is set higher than ~30%.
Semi transparent objects are often rendered as particle effects. Grass and smoke are the most common examples. The point list for the effect or group of grass objects is sorted each frame. This is a much simpler problem than sorting arbitrary sub meshes. Many outdoor games have complex grass and foliage instancing systems. These allow them to render individual leaves, and blades properly sorted and avoid most of the rendering overhead of doing it in this fashion but they strictly limit the types of objects.
Many effects can be done in an order independent way using additive and subtraction blending rather than alpha blending.
There are a couple easy options if your smooth edges are still unacceptable. You can dither any parts of the model below 75% transparency. Or you can have the hardware do it for you without visible artifacts by using coverage-to-alpha. This causes the multisampling hardware to dither the edges in the overdrawn samples. It won't give you a smooth gradient but the 4-16 levels of alpha are perfectly acceptable for anti-aliasing edges and free if you already intend to use MSAA.
There are a lot of caveats and special cases. If you have water you will probably need to render any semi-transparent objects that intersect the water twice using a stencil or depth test.
Moving the camera in and out of transparent objects is always problematic.
It is nearly impossible to render a complex semi-transparent object. Like an x-ray view of a building or a ghost. Many games simply render this type of object as additive. But with modern hardware a variety of more complex schemes are possible.
More complex schemes
Depth Peeling is a method of rendering where you render multiple passes with different Z-clipping planes to composite the scene from back to front regardless of order or what object contains the alpha. It is less expensive than you would expect because many objects render to only one or two slices. But it is not perfect and many game developers find it too costly.
There are many other varieties of Order Independent Transparency. With a modern GPU and compute we can render in a single pass to a buffer where each pixel is a stack of possible slices. We can then sort the stack and blend these slices in a post process, and only incur the performance penalty when there are layers of transparency on a pixel.
OIT is still mostly only used in special cases like 2.5D games (such as little Big Planet). But I believe that it may eventually become a core tool in game programming.

Frustum culling with VBOs

I am planning on writing a 3D game that will be using VBOs for rendering. Let's say, for example, that the terrain is a set of tiles and their vertices are all in the same VBO. The player should be able to scroll through the tiles, and at all times would see only a part of them.
I would like to perform frustum culling on those tiles. I already have found some sources on the maths part of frustum culling, but I am not sure how I would go about implementing this with a VBO - do people do that somehow in the vertex shader, or do they just call the rendering function to draw a subset of the VBO.

Given that your camera acts like in Diablo (wherever Isometric or with Perspective):
If you have a fixed map size, you can use 1 VBO for the base geometry of your map, Assuming you will use a heightmap based solution. The Quads not visible will be discarded by your graphics card after the vertex shader, not affecting your pixel fillrate. They are not worth the overhead of culling on your side. Details like Rocks, Houses etc will have their own VBO anyways.
If you aim for a streaming content engine with a huge seamless world, create chunks, the size of a chunk depends on your game. Divide your terrain into those chunks and test the camera frustum against their bounding boxes before drawing.
About drawing chunks:
The simplest way, which is enough for most games, is to make each chunk its own geometry, VBO, and so on. But you can optimize later and your terrain implementation should not drive your engine API designs (you will have to implement many different ways to draw things in your engine, for instance particles, post processing effects, etc..).
One way you can optimize is with only one VBO for the geometry and the usage of instanced drawing, just like in particle systems you then take some other source for some of your data, like your global transformation, the height of each vertex and so on.
But keep in mind, most games dont really need that much optimization in just the terrain. There will come other systems across your path more worthy of optimizations.

3d Occlusion Culling

I'm writing a Minecraft like static 3d block world in C++ / openGL. I'm working at improving framerates, and so far I've implemented frustum culling using an octree. This helps, but I'm still seeing moderate to bad frame rates. The next step would be to cull cubes that are hidden from the viewpoint by closer cubes. However I haven't been able to find many resources on how to accomplish this.

Create a render target with a Z-buffer (or "depth buffer") enabled. Then make sure to sort all your opaque objects so they are rendered front to back, i.e. the ones closest to the camera first. Anything using alpha blending still needs to be rendered back to front, AFTER you rendered all your opaque objects.
Another technique is occlusion culling: You can cheaply "dry-render" your geometry and then find out how many pixels failed the depth test. There is occlusion query support in DirectX and OpenGL, although not every GPU can do it.
The downside is that you need a delay between the rendering and fetching the result - depending on the setup (like when using predicated tiling), it may be a full frame. That means that you need to be creative there, like rendering a bounding box that is bigger than the object itself, and dismissing the results after a camera cut.
And one more thing: A more traditional solution (that you can use concurrently with occlusion culling) is a room/portal system, where you define regions as "rooms", connected via "portals". If a portal is not visible from your current room, you can't see the room connected to it. And even it is, you can click your viewport to what's visible through the portal.

The approach I took in this minecraft level renderer is essentially a frustum-limited flood fill. The 16x16x128 chunks are split into 16x16x16 chunklets, each with a VBO with the relevant geometry. I start a floodfill in the chunklet grid at the player's location to find chunklets to render. The fill is limited by:
The view frustum
Solid chunklets - if the entire side of a chunklet is opaque blocks, then the floodfill will not enter the chunklet in that direction
Direction - the flood will not reverse direction, e.g.: if the current chunklet is to the north of the starting chunklet, do not flood into the chunklet to the south
It seems to work OK. I'm on android, so while a more complex analysis (antiportals as noted by Mike Daniels) would cull more geometry, I'm already CPU-limited so there's not much point.
I've just seen your answer to Alan: culling is not your problem - it's what and how you're sending to OpenGL that is slow.
What to draw: don't render a cube for each block, render the faces of transparent blocks that border an opaque block. Consider a 3x3x3 cube of, say, stone blocks: There is no point drawing the center block because there is no way that the player can see it. Likewise, the player will never see the faces between two adjacent stone blocks, so don't draw them.
How to draw: As noted by Alan, use VBOs to batch geometry. You will not believe how much faster they make things.
An easier approach, with minimal changes to your existing code, would be to use display lists. This is what minecraft uses.

How many blocks are you rendering and on what hardware? Modern hardware is very fast and is very difficult to overwhelm with geometry (unless we're talking about a handheld platform). On any moderately recent desktop hardware you should be able to render hundreds of thousands of cubes per frame at 60 frames per second without any fancy culling tricks.
If you're drawing each block with a separate draw call (glDrawElements/Arrays, glBegin/glEnd, etc) (bonus points: don't use glBegin/glEnd) then that will be your bottleneck. This is a common pitfall for beginners. If you're doing this, then you need to batch together all triangles that share texture and shading parameters into a single call for each setup. If the geometry is static and doesn't change frame to frame, you want to use one Vertex Buffer Object for each batch of triangles.
This can still be combined with frustum culling with an octree if you typically only have a small portion of your total game world in the view frustum at one time. The vertex buffers are still loaded statically and not changed. Frustum cull the octree to generate only the index buffers for the triangles in the frustum and upload those dynamically each frame.

If you have surfaces close to the camera, you can create a frustum which represents an area that is not visible, and cull objects that are entirely contained in that frustum. In the diagram below, C is the camera, | is a flat surface near the camera, and the frustum-shaped region composed of . represents the occluded area. The surface is called an antiportal.
.
..
...
....
|....
|....
|....
|....
C |....
|....
|....
|....
....
...
..
.
(You should of course also turn on depth testing and depth writing as mentioned in other answer and comments -- it's very simple to do in OpenGL.)

The use of a Z-Buffer ensures that polygons overlap correctly.
Enabling the depth test makes every drawing operation check the Z-buffer before placing pixels onto the screen.
If you have convex objects you must (for performance) enable backface culling!
Example code:
glEnable(GL_CULL_FACE);
glEnable(GL_DEPTH_TEST);
glDepthMask(GL_TRUE);
You can change the behaviour of glCullFace() passing GL_FRONT or GL_BACK...
glCullFace(...);
// Draw the "game world"...

OpenGL voxel engine slow

I'm making a voxel engine in C++ and OpenGL (à la Minecraft) and can't get decent fps on my 3GHz with ATI X1600... I'm all out of ideas.
When I have about 12000 cubes on the screen it falls to under 20fps - pathetic.
So far the optimizations I have are: frustum culling, back face culling (via OpenGL's glEnable(GL_CULL_FACE)), the engine draws only the visible faces (except the culled ones of course) and they're in an octree.
I've tried VBO's, I don't like them and they do not significantly increase the fps.
How can Minecraft's engine be so fast... I struggle with a 10000 cubes, whereas Minecraft can easily draw much more at higher fps.
Any ideas?

#genpfault: I analyze the connectivity and just generate faces for the outer, visible surface. The VBO had a single cube that I glTranslate()d
I'm not an expert at OpenGL, but as far as I understand this is going to save very little time because you still have to send every cube to the card.
Instead what you should do is generate faces for all of the outer visible surface, put that in a VBO, and send it to the card and continue to render that VBO until the geometry changes. This saves you a lot of the time your card is actually waiting on your processor to send it the geometry information.

You should profile your code to find out if the bottleneck in your application is on the CPU or GPU. For instance it might be that your culling/octtree algorithms are slow and in that case it is not an OpenGL-problem at all.
I would also keep count of the number of cubes you draw on each frame and display that on screen. Just so you know your culling routines work as expected.
Finally you don't mention if your cubes are textured. Try using smaller textures or disable textures and see how much the framerate increases.
gDEBugger is a great tool that will help you find bottlenecks with OpenGL.

I don't know if it's ok here to "bump" an old question but a few things came up my mind:
If your voxels are static you can speed up the whole rendering process by using an octree for frustum culling, etc. Furthermore you can also compile a static scene into a potential-visibility-set in the octree. The main principle of PVS is to precompute for evere node in the tree which other nodes are potential visible from it and store pointers to them in a vector. When it comes to rendering you first check in which node the camera is placed and then run frustum culling against all nodes in the PVS-vector of the node.(Carmack used something like that in the Quake engines, but with Binary Space Partitioning trees)
If the shading of your voxels is kindalike complex it is also fast to do a pre-Depth-Only-Pass, without writing into the colorbuffer,just to fill the Depthbuffer. After that you render a 2nd pass: disable writing to the Depthbuffer and render only to the Colorbuffer while checking the Depthbuffer. So you avoid expensive shader-computations which are later overwritten by a new fragment which is closer to the viewer.(Carmack used that in Quake3)
Another thing which will definitely speed up things is the use of Instancing. You store only the position of each voxel and, if nescessary, its scale and other parameters into a texturebufferobject. In the vertexshader you can then read the positions of the voxels to be spawned and create an instance of the voxel(i.e. a cube which is given to the shader in a vertexbufferobject). So you send the 8 Vertices + 8 Normals (3 *sizeof(float) *8 +3 *sizeof(float) *8 + floats for color/texture etc...) only once to the card in the VBO and then only the positions of the instances of the Cube(3*sizeof(float)*number of voxels) in the TBO.
Maybe it is possibile to parallelize things between GPU and CPU by combining all 3 steps in 2 threads, in the CPU-thread you check the octrees pvs and update a TBO for instancing in the next frame, the GPU-thread does meanwhile render the 2 passes while using an TBO for instancing which was created by the CPU thread in the previous step. After that you switch TBOs. If the Camera has not moved you don't even have to do the CPU-calculations again.
Another kind of tree you me be interested in is the so called k-d-tree, which is more general than octrees.
PS: sorry for my english, it's not the clearest....

There are 3rd-party libraries you could use to make the rendering more efficient. For example the C++ PolyVox library can take a volume and generate the mesh for you in an efficient way. It has built-in methods for reducing triangle count and helping to generate things like ambient occlusion. It's got a good community around it so getting support on the forum should be easy.

Have you used a common display list for all your cubes ?
Do you skip calling drawing code of cubes which are not visible to the user ?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js