How lighting in building games with unlimited number of lights works? - opengl

In Minecraft for example, you can place torches anywhere and each one effects the light level in the world and there is no limit to the amount of torches / light sources you can put down in the world. I am 99% sure that the lighting for the torches is taken care of on the CPU and stored for each block and so when rendering the light value at that certain block just needs to be passed into the shader, but light sources cannot move for this reason. If you had a game where you could place light sources that could move around (arrow on fire, minecart with a light on it, glowing ball of energy) and the lighting wasn't as simple (color was included) what are the most efficient ways to calculate the lighting effects.
From my research I have found differed rendering, differed lighting, dynamically creating shaders with different amounts of lights available and using a for loop (can't use uniforms due to unrolling), and static light maps (these would probably only be used for the still lights). Are there any other ways to do lighting calculations such as doing what minecraft does except allowing moving lights, or is it possible to take an infinite amount of lights and mathematically combine them into an approximation that only involves a few lights (this is an idea I came up with but I can't figure out how it could be done)?
If it helps, I am a programmer with decent experience in OpenGL (legacy and modern) so you can give me code snippets although I have not done too much with lighting so brief explanations would be appreciated. I am also willing to do research if you can point me in the right direction!

Your title is a bit misleading infinite light implies directional light in infinite distance like Sun. I would use unlimited number of lights instead. Here some approaches for this I know of:
(back) ray-tracers
they can handle any number of light sources natively. Light is just another object in engine. If ray hits the light source it just take the light intensity and stop the recursion. Unfortunately current gfx hardware is not suited for this kind of rendering. There are GPU enhanced engines for this but the specialized gfx HW is still in development and did not hit the market yet. Memory requirements are not much different then standard BR rendering and You can still use BR meshes but mathematical (analytical) meshes are natively supported and are better for this.
Standard BR rendering
BR means boundary representation such engines (Like OpenGL fixed function) can handle only limited number of lights. This is because each primitive/fragment needs the complete list of lights and the computations are done for all light on per primitive or per fragment basis. If you got many light this would be slow.
GLSL example of fixed number of light sources see the fragment shader
Also the current GPU's have limited memory for uniforms (registers) in which the lights and other rendering parameters are stored so there are possible workarounds like have light parameters stored in a texture and iterate over all of them per primitive/fragment inside GLSL shader but the number of lights affect performance of coarse so you are limited by target frame-rate and computational power. Additional memory requirements for this is just the texture with light parameters which is not so much (few vectors per light).
light maps
they can be computed even for moving objects. Complex light maps can be computed slowly (not per frame). This leads to small lighting artifacts but you need to know what to look for to spot it. Light maps and shadow maps are very similar and often computed at once. There are simple light maps and complex radiation maps models out there
look Shading mask algorithm for radiation calculations
These are either:
projected 2D maps (hard to implement/use and often less precise)
3D Voxel maps (Memory demanding but easier to compute/use)
Some approaches uses pre-rendered Z-Buffer as geometry source and then fill the lights via Radiosity or any other technique. These can handle any number of lights as these maps can be computation demanding they are often computed in the background and updated once in a while.
fast moving light sources are usually updated more often or excluded from maps and rendered as transparent geometry to make impression of light. The computational power needed for this depends on the computation method the basic are done like:
set a camera to the larges visible surfaces
render scene and handle the result as light/shadow map
store it as 2D or 3D texture or voxel map
and then continue with normal rendering from camera view
So you need to render scene more then once per frame/map update and also need additional buffers to store the rendered result which for high resolution or Voxel maps can be a big chunk of memory.
multi pass light layer
there are cases when light is added after rendering of the scene for example I used it for
Atmospheric scattering in GLSL
Here comes all multi pass rendering techniques you need additional buffers to store the sub results and usually the multi pass rendering is done on the same view/scene so pre-rendered geometry is used which significantly speeds this up either as locked VAO or as already rendered Z-buffer Color and Index buffers from first pass. After this handle next passes as single or few Quads (like in the Atmospheric scattering link) so the computational power needed for this is not much bigger in comparison to basic BR rendering
forward rendering vs. deferred-rendering
in a google this forward rendering vs. deferred-rendering is first relevant hit I found. It is not very good one (a bit to vague for my taste) but for starters it is enough
forward rendering techniques are usually standard single pass BR renders
deffered rendering is standard multi pass renders. In first pass is rendered all the geometries of the scene into Z buffer, Color buffer and some auxiliary buffers just to know which fragment of the result belongs to which object,material,... And then in the next passes are added effects,light,shadows,... but the geometry is not rendered again instead just single or few overlay QUADs/per pass are rendered so the next passes are usually pretty fast ...
The link suggest that for high lights number is the deffered rendering more suited but that strongly depends on which of the previous technique is used. Usually the multi pass light layer is used (with is one of the standard deffered rendering techniques) so in that case it is true, and the memory and computational power demands are the same see the previous section.

Related

Efficiently providing geometry for terrain physics

I have been researching different approaches to terrain systems in game engines for a bit now, trying to familiarize myself with the work. A number of the details seem straightforward, but I am getting hung up on a single detail.
For performance reasons many terrain solutions utilize shaders to generate parts or all of the geometry, such as vertex shaders to generate positions or tessellation shaders for LoD. At first I figured those approaches were exclusively for renders that weren't concerned about physics simulations.
The reason I say that is because as I understand shaders at the moment, the results of a shader computation generally are discarded at the end of the frame. So if you rely on shaders heavily then the geometry information will be gone before you could access it and send it off to another system (such as physics running on the CPU).
So, am I wrong about shaders? Can you store the results of them generating geometry to be accessed by other systems? Or am I forced to keep the terrain geometry on CPU and leave the shaders to the other details?
Shaders
You understand parts of the shaders correctly, that is: after a frame, the data is stored as a final composed image in the backbuffer.
BUT: Using transform feedback it is possible to capture transformed geometry into a vertex buffer and reuse it. Transform Feedback happens AFTER the vertex/geometry/tessellation shader, so you could use the geometry shader to generate a terrain (or visible parts of it once), push it through transform-feedback and store it.
This way, you potentially could use CPU collision detection with your terrain! You can even combine this with tessellation.
You will love this: A Framework for Real-Time, Deformable Terrain.
For the LOD and tessellation: LOD is not the prerequisite of tessellation. You can use tessellation to allow some more sophisticated effects such as adding a detail by recursive subdivision of rough geometry. Linking it with LOD is simply a very good optimization avoiding RAM-memory based LOD-mesh-levels, since you just have your "base mesh" and subdivide it (Although this will be an unsatisfying optimization imho).
Now some deeper info on GPU and CPU exclusive terrain.
GPU Generated Terrain (Procedural)
As written in the NVidia article Generating Complex Procedural Terrains Using the GPU:
1.2 Marching Cubes and the Density Function Conceptually, the terrain surface can be completely described by a single function, called the
density function. For any point in 3D space (x, y, z), the function
produces a single floating-point value. These values vary over
space—sometimes positive, sometimes negative. If the value is
positive, then that point in space is inside the solid terrain.
If the value is negative, then that point is located in empty space
(such as air or water). The boundary between positive and negative
values—where the density value is zero—is the surface of the terrain.
It is along this surface that we wish to construct a polygonal mesh.
Using Shaders
The density function used for generating the terrain, must be available for the collision-detection shader and you have to fill an output buffer containing the collision locations, if any...
CUDA
See: https://www.youtube.com/watch?v=kYzxf3ugcg0
Here someone used CUDA, based on the NVidia article, which however implies the same:
In CUDA, performing collision detection, the density function must be shared.
This will however make the transform feedback techniques a little harder to implement.
Both, Shaders and CUDA, imply resampling/recalculation of the density at at least one location, just for the collision detection of a single object.
CPU Terrain
Usually, this implies a RAM-memory stored set of geometry in the form of vertex/index-buffer pairs, which are regularly processed by the shader-pipeline. As you have the data available here, you will also most likely have a collision mesh, which is a simplified representation of your terrain, against which you perform collision.
Alternatively you could spend your terrain a set of colliders, marking the allowed paths, which is imho performed in the early PS1 Final Fantasy games (which actually don't really have a terrain in the sense we understand terrain today).
This short answer is neither extensively deep nor complete. I just tried to give you some insight into some concepts used in dozens of solutions.
Some more reading: http://prideout.net/blog/?tag=opengl-transform-feedback.

Tiled deferred shading without compute shader

I'm building a deferred renderer and since I want to support a large amount of lights in the scene I've had a look at tiled deferred shading.
The problem is that I have to target OpenGL 3.3 hardware and it doesn't support GLSL compute shaders.
Is there a possibility to implement tiled deferred shading with normal shaders?
Tiled deferred rendering does not strictly require a compute shader. What it requires is that, for each tile, you have a series of lights which it will process. A compute shader is merely one way to accomplish that.
An alternative is to build the light lists for each frustum on the CPU, then upload that data to the GPU for its eventual use. Obviously it requires much more memory work than the CS version. But it's probably not that expensive, and it allows you to easily play with tile sizes to find the most optimal. More tiles means more CPU work and more data to be uploaded, but fewer lights-per-tile (generally speaking) and more efficient processing.
One way to do that for GL 3.3-class hardware is to make each tile a separate quad. The quad will be given, as part of its per-vertex parameters, the starting index for its part of the total light list and the total number of lights for that tile to process. The idea being that there is a globally-accessible array, and each tile has a contiguous region of this array that it will process.
This array could be the actual lights themselves, or it could be indices into a second (much smaller) array of lights. You'll have to measure the difference to tell if it's worthwhile to have the additional indirection in the access.
The primary array should probably be a buffer texture, since it can get quite large, depending on the number of lights and tiles. If you go with the indirect route, then the array of actual light data will likely fit into a uniform block. But in either case, you're going to need to employ buffer streaming techniques when uploading to it.

Deferred Rendering with Tile-Based culling Concept Problems

EDIT: I'm still looking for some help about the use of OpenCL or compute shaders. I would prefer to keep using OGL 3.3 and not have to deal with the bad driver support for OGL 4.3 and OpenCL 1.2, but I can't think of anyway to do this type of shading without using one of the two (to match lights and tiles). Is it possible to implement tile-based culling without using GPGPU?
I wrote a deferred render in OpenGL 3.3. Right now I don't do any culling for the light pass (I just render a full screen quad for every light). This (obviously) has a ton of overdraw. (Sometimes it is ~100%). Because of this I've been looking into ways to improve performance during the light pass. It seems like the best way in (almost) everyone's opinion is to cull the scene using screen space tiles. This was the method used in Frostbite 2. I read the the presentation from Andrew Lauritzen during SIGGRAPH 2010 (http://download-software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf) , and I'm not sure I fully understand the concept. (and for that matter why it's better than anything else, and if it is better for me)
In the presentation Laurtizen goes over deferred shading with light volumes, quads, and tiles for culling the scene. According to his data, the tile based deferred renderer was the fastest (by far). I don't understand why it is though. I'm guessing it has something to do with the fact that for each tile, all the lights are batched together. In the presentation it says to read the G-Buffer once and then compute the lighting, but this doesn't make sense to me. In my mind, I would implement this like this:
for each tile {
for each light effecting the tile {
render quad (the tile) and compute lighting
blend with previous tiles (GL_ONE, GL_ONE)
}
}
This would still involve sampling the G-Buffer a lot. I would think that doing that would have the same (if not worse) performance than rendering a screen aligned quad for every light. From how it's worded though, it seems like this is what's happening:
for each tile {
render quad (the tile) and compute all lights
}
But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs . Can anyone help me with this? It also seems like almost every tile based deferred renderer uses compute shaders or OpenCL (to batch the lights), why is this, and if I didn't use these what would happen?
But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs .
It rather depends on how many lights you have. The "instruction limits" are pretty high; it's generally not something you need to worry about outside of degenerate cases. Even if 100+ lights affects a tile, odds are fairly good that your lighting computations aren't going to exceed instruction limits.
Modern GL 3.3 hardware can run at least 65536 dynamic instructions in a fragment shader, and likely more. For 100 lights, that's still 655 instructions per light. Even if you take 2000 instructions to compute the camera-space position, that still leaves 635 instructions per light. Even if you were doing Cook-Torrance directly in the GPU, that's probably still sufficient.

Is it possible to reuse glsl vertex shader output later?

I have a huge mesh(100k triangles) that needs to be drawn a few times and blend together every frame. Is it possible to reuse the vertex shader output of the first pass of mesh, and skip the vertex stage on later passes? I am hoping to save some cost on the vertex pipeline and rasterization.
Targeted OpenGL 3.0, can use features like transform feedback.
I'll answer your basic question first, then answer your real question.
Yes, you can store the output of vertex transformation for later use. This is called Transform Feedback. It requires OpenGL 3.x-class hardware or better (aka: DX10-hardware).
The way it works is in two stages. First, you have to set your program up to have feedback-based varyings. You do this with glTransformFeedbackVaryings. This must be done before linking the program, in a similar way to things like glBindAttribLocation.
Once that's done, you need to bind buffers (given how you set up your transform feedback varyings) to GL_TRANSFORM_FEEDBACK_BUFFER with glBindBufferRange, thus setting up which buffers the data are written into. Then you start your feedback operation with glBeginTransformFeedback and proceed as normal. You can use a primitive query object to get the number of primitives written (so that you can draw it later with glDrawArrays), or if you have 4.x-class hardware (or AMD 3.x hardware, all of which supports ARB_transform_feedback2), you can render without querying the number of primitives. That would save time.
Now for your actual question: it's probably not going to help buy you any real performance.
You're drawing terrain. And terrain doesn't really get any transformation. Typically you have a matrix multiplication or two, possibly with normals (though if you're rendering for shadow maps, you don't even have that). That's it.
Odds are very good that if you shove 100,000 vertices down the GPU with such a simple shader, you've probably saturated the GPU's ability to render them all. You'll likely bottleneck on primitive assembly/setup, and that's not getting any faster.
So you're probably not going to get much out of this. Feedback is generally used for either generating triangle data for later use (effectively pseudo-compute shaders), or for preserving the results from complex transformations like matrix palette skinning with dual-quaternions and so forth. A simple matrix multiply-and-go will barely be a blip on the radar.
You can try it if you like. But odds are you won't have any problems. Generally, the best solution is to employ some form of deferred rendering, so that you only have to render an object once + X for every shadow it casts (where X is determined by the shadow mapping algorithm). And since shadow maps require different transforms, you wouldn't gain anything from feedback anyway.

OpenGL voxel engine slow

I'm making a voxel engine in C++ and OpenGL (à la Minecraft) and can't get decent fps on my 3GHz with ATI X1600... I'm all out of ideas.
When I have about 12000 cubes on the screen it falls to under 20fps - pathetic.
So far the optimizations I have are: frustum culling, back face culling (via OpenGL's glEnable(GL_CULL_FACE)), the engine draws only the visible faces (except the culled ones of course) and they're in an octree.
I've tried VBO's, I don't like them and they do not significantly increase the fps.
How can Minecraft's engine be so fast... I struggle with a 10000 cubes, whereas Minecraft can easily draw much more at higher fps.
Any ideas?
#genpfault: I analyze the connectivity and just generate faces for the outer, visible surface. The VBO had a single cube that I glTranslate()d
I'm not an expert at OpenGL, but as far as I understand this is going to save very little time because you still have to send every cube to the card.
Instead what you should do is generate faces for all of the outer visible surface, put that in a VBO, and send it to the card and continue to render that VBO until the geometry changes. This saves you a lot of the time your card is actually waiting on your processor to send it the geometry information.
You should profile your code to find out if the bottleneck in your application is on the CPU or GPU. For instance it might be that your culling/octtree algorithms are slow and in that case it is not an OpenGL-problem at all.
I would also keep count of the number of cubes you draw on each frame and display that on screen. Just so you know your culling routines work as expected.
Finally you don't mention if your cubes are textured. Try using smaller textures or disable textures and see how much the framerate increases.
gDEBugger is a great tool that will help you find bottlenecks with OpenGL.
I don't know if it's ok here to "bump" an old question but a few things came up my mind:
If your voxels are static you can speed up the whole rendering process by using an octree for frustum culling, etc. Furthermore you can also compile a static scene into a potential-visibility-set in the octree. The main principle of PVS is to precompute for evere node in the tree which other nodes are potential visible from it and store pointers to them in a vector. When it comes to rendering you first check in which node the camera is placed and then run frustum culling against all nodes in the PVS-vector of the node.(Carmack used something like that in the Quake engines, but with Binary Space Partitioning trees)
If the shading of your voxels is kindalike complex it is also fast to do a pre-Depth-Only-Pass, without writing into the colorbuffer,just to fill the Depthbuffer. After that you render a 2nd pass: disable writing to the Depthbuffer and render only to the Colorbuffer while checking the Depthbuffer. So you avoid expensive shader-computations which are later overwritten by a new fragment which is closer to the viewer.(Carmack used that in Quake3)
Another thing which will definitely speed up things is the use of Instancing. You store only the position of each voxel and, if nescessary, its scale and other parameters into a texturebufferobject. In the vertexshader you can then read the positions of the voxels to be spawned and create an instance of the voxel(i.e. a cube which is given to the shader in a vertexbufferobject). So you send the 8 Vertices + 8 Normals (3 *sizeof(float) *8 +3 *sizeof(float) *8 + floats for color/texture etc...) only once to the card in the VBO and then only the positions of the instances of the Cube(3*sizeof(float)*number of voxels) in the TBO.
Maybe it is possibile to parallelize things between GPU and CPU by combining all 3 steps in 2 threads, in the CPU-thread you check the octrees pvs and update a TBO for instancing in the next frame, the GPU-thread does meanwhile render the 2 passes while using an TBO for instancing which was created by the CPU thread in the previous step. After that you switch TBOs. If the Camera has not moved you don't even have to do the CPU-calculations again.
Another kind of tree you me be interested in is the so called k-d-tree, which is more general than octrees.
PS: sorry for my english, it's not the clearest....
There are 3rd-party libraries you could use to make the rendering more efficient. For example the C++ PolyVox library can take a volume and generate the mesh for you in an efficient way. It has built-in methods for reducing triangle count and helping to generate things like ambient occlusion. It's got a good community around it so getting support on the forum should be easy.
Have you used a common display list for all your cubes ?
Do you skip calling drawing code of cubes which are not visible to the user ?