Is there an efficient way to exceed GL_MAX_VIEWPORTS? - opengl

I am currently implementing the pose estimation algorithm proposed in Oikonomidis et al., 2011, which involves rendering a mesh in N different hypothesised poses (N will probably be about 64). Section 2.5 suggests speeding up the computation by using instancing to generate multiple renderings simultaneously (after which they reduce each rendering to a single number on the GPU), and from their description, it sounds like they found a way to produce N renderings simultaneously.
In my implementation's setup phase, I use an OpenGL viewport array to define GL_MAX_VIEWPORTS viewports. Then in the rendering phase, I transfer an array of GL_MAX_VIEWPORTS model-pose matrices to a mat4 uniform array in GPU memory (I am only interested in estimating position and orientation), and use gl_InvocationID in my geometry shader to select the appropriate pose matrix and viewport for each polygon of the mesh.
GL_MAX_VIEWPORTS is 16 on my machine (I have a GeForce GTX Titan), so this method will allow me to render up to 16 hypotheses at a time on the GPU. This may turn out to be fast enough, but I am nonetheless curious about the following:
Is there is a workaround for the GL_MAX_VIEWPORTS limitation that is likely to be faster than calling my render function ceil(double(N)/GL_MX_VIEWPORTS) times?
I only started learning the shader-based approach to OpenGL a couple of weeks ago, so I don't yet know all the tricks. I initially thought of replacing my use of the built-in viewport support with a combination of:
a geometry shader that adds h*gl_InvocationID to the y coordinates of the vertices after perspective projection (where h is the desired viewport height) and passes gl_InvocationID onto the fragment shader; and
a fragment shader that discards fragments with y coordinates that satisfy y<gl_InvocationID*h || y>=(gl_InvocationID+1)*h.
But I was put off investigating this idea further by the fear that branching and discard would be very detrimental to performance.
The authors of the paper above released a technical report describing some of their GPU acceleration methods, but it's not detailed enough to answer my question. Section 3.2.3 says "During geometry instancing, viewport information is attached to every vertex... A custom pixel shader clips pixels that are outside their pre-defined viewports". This sounds similar to the workaround that I've described above, but they were using Direct3D, so it's not easy to compare what they were able to achieve with that in 2011 to what I can achieve today in OpenGL.
I realise that the only definitive answer to my question is to implement the workaround and measure its performance, but it's currently a low-priority curiosity, and I haven't found answers anywhere else, so I hoped that a more experienced GLSL user might be able to offer their time-saving wisdom.

From a cursory glance at the paper, it seems to me that the actual viewport doesn't change. That is, you're still rendering to the same width/height and X/Y positions, with the same depth range.
What you want is to change which image you're rendering to. Which is what gl_Layer is for; to change which layer within the layered array of images attached to the framebuffer you are rendering to.
So just set the gl_ViewportIndex to 0 for all vertices. Or more specifically, don't set it at all.
The number of GS instancing invocations does not have to be a restriction; that's your choice. GS invocations can write multiple primitives, each to a different layer. So you could have each instance write, for example, 4 primitives, each to 4 separate layers.
Your only limitations should be the number of layers you can use (governed by GL_MAX_ARRAY_TEXTURE_LAYERS and GL_MAX_FRAMEBUFFER_LAYERS, both of which must be at least 2048), and the number of primitives and vertex data that a single GS invocation can emit (which is kind of complicated).

Related

Can GLSL handle buffers with arbitrary length?

I have an art application I'm dabbling with that uses OpenGL for accelerated graphics rendering. I'd like to be able to add the ability to draw arbitrary piecewise curves - pretty much the same sort of shapes that can be defined by the SVG 'path' element.
Rather than tessellating my paths into polygons on the CPU, I thought it might be better to pass an array of values in a buffer to my shader defining the pieces of my curve and then using an in/out test to check which pixels were actually inside. In other words, I'd be iterating through a potentially large array of data describing each segment in my path.
From what I remember back when I learned shader programming years ago, GPUs handle if statements by evaluating both branches and then throwing away the branch that wasn't used. This would effectively mean that it would end up silently running through my entire buffer even if I only used a small part of it (i.e., my buffer has the capacity to handle 1024 curve segments, but the simple rectangle I'm drawing only uses the first four of them).
How do I write my code to deal with this variable data? Can modern GPUs handle conditional code like this well?
GPUs can handle arbitrary-length buffers and conditionals (or fake it convincingly). The problem is that a vertex and geometry shaders cannot generate arbitrary number of triangles from a short description.
OpenGL 4.0 added two new types of shaders: Tessellation Control shaders and Tessellation Evaluation shaders. These shaders give you the ability to tessellate curves and surfaces on the GPU.
I found this tutorial to be quite useful in showing how to tessellate Bezier curves on the GPU.

Efficiently providing geometry for terrain physics

I have been researching different approaches to terrain systems in game engines for a bit now, trying to familiarize myself with the work. A number of the details seem straightforward, but I am getting hung up on a single detail.
For performance reasons many terrain solutions utilize shaders to generate parts or all of the geometry, such as vertex shaders to generate positions or tessellation shaders for LoD. At first I figured those approaches were exclusively for renders that weren't concerned about physics simulations.
The reason I say that is because as I understand shaders at the moment, the results of a shader computation generally are discarded at the end of the frame. So if you rely on shaders heavily then the geometry information will be gone before you could access it and send it off to another system (such as physics running on the CPU).
So, am I wrong about shaders? Can you store the results of them generating geometry to be accessed by other systems? Or am I forced to keep the terrain geometry on CPU and leave the shaders to the other details?
Shaders
You understand parts of the shaders correctly, that is: after a frame, the data is stored as a final composed image in the backbuffer.
BUT: Using transform feedback it is possible to capture transformed geometry into a vertex buffer and reuse it. Transform Feedback happens AFTER the vertex/geometry/tessellation shader, so you could use the geometry shader to generate a terrain (or visible parts of it once), push it through transform-feedback and store it.
This way, you potentially could use CPU collision detection with your terrain! You can even combine this with tessellation.
You will love this: A Framework for Real-Time, Deformable Terrain.
For the LOD and tessellation: LOD is not the prerequisite of tessellation. You can use tessellation to allow some more sophisticated effects such as adding a detail by recursive subdivision of rough geometry. Linking it with LOD is simply a very good optimization avoiding RAM-memory based LOD-mesh-levels, since you just have your "base mesh" and subdivide it (Although this will be an unsatisfying optimization imho).
Now some deeper info on GPU and CPU exclusive terrain.
GPU Generated Terrain (Procedural)
As written in the NVidia article Generating Complex Procedural Terrains Using the GPU:
1.2 Marching Cubes and the Density Function Conceptually, the terrain surface can be completely described by a single function, called the
density function. For any point in 3D space (x, y, z), the function
produces a single floating-point value. These values vary over
space—sometimes positive, sometimes negative. If the value is
positive, then that point in space is inside the solid terrain.
If the value is negative, then that point is located in empty space
(such as air or water). The boundary between positive and negative
values—where the density value is zero—is the surface of the terrain.
It is along this surface that we wish to construct a polygonal mesh.
Using Shaders
The density function used for generating the terrain, must be available for the collision-detection shader and you have to fill an output buffer containing the collision locations, if any...
CUDA
See: https://www.youtube.com/watch?v=kYzxf3ugcg0
Here someone used CUDA, based on the NVidia article, which however implies the same:
In CUDA, performing collision detection, the density function must be shared.
This will however make the transform feedback techniques a little harder to implement.
Both, Shaders and CUDA, imply resampling/recalculation of the density at at least one location, just for the collision detection of a single object.
CPU Terrain
Usually, this implies a RAM-memory stored set of geometry in the form of vertex/index-buffer pairs, which are regularly processed by the shader-pipeline. As you have the data available here, you will also most likely have a collision mesh, which is a simplified representation of your terrain, against which you perform collision.
Alternatively you could spend your terrain a set of colliders, marking the allowed paths, which is imho performed in the early PS1 Final Fantasy games (which actually don't really have a terrain in the sense we understand terrain today).
This short answer is neither extensively deep nor complete. I just tried to give you some insight into some concepts used in dozens of solutions.
Some more reading: http://prideout.net/blog/?tag=opengl-transform-feedback.

How lighting in building games with unlimited number of lights works?

In Minecraft for example, you can place torches anywhere and each one effects the light level in the world and there is no limit to the amount of torches / light sources you can put down in the world. I am 99% sure that the lighting for the torches is taken care of on the CPU and stored for each block and so when rendering the light value at that certain block just needs to be passed into the shader, but light sources cannot move for this reason. If you had a game where you could place light sources that could move around (arrow on fire, minecart with a light on it, glowing ball of energy) and the lighting wasn't as simple (color was included) what are the most efficient ways to calculate the lighting effects.
From my research I have found differed rendering, differed lighting, dynamically creating shaders with different amounts of lights available and using a for loop (can't use uniforms due to unrolling), and static light maps (these would probably only be used for the still lights). Are there any other ways to do lighting calculations such as doing what minecraft does except allowing moving lights, or is it possible to take an infinite amount of lights and mathematically combine them into an approximation that only involves a few lights (this is an idea I came up with but I can't figure out how it could be done)?
If it helps, I am a programmer with decent experience in OpenGL (legacy and modern) so you can give me code snippets although I have not done too much with lighting so brief explanations would be appreciated. I am also willing to do research if you can point me in the right direction!
Your title is a bit misleading infinite light implies directional light in infinite distance like Sun. I would use unlimited number of lights instead. Here some approaches for this I know of:
(back) ray-tracers
they can handle any number of light sources natively. Light is just another object in engine. If ray hits the light source it just take the light intensity and stop the recursion. Unfortunately current gfx hardware is not suited for this kind of rendering. There are GPU enhanced engines for this but the specialized gfx HW is still in development and did not hit the market yet. Memory requirements are not much different then standard BR rendering and You can still use BR meshes but mathematical (analytical) meshes are natively supported and are better for this.
Standard BR rendering
BR means boundary representation such engines (Like OpenGL fixed function) can handle only limited number of lights. This is because each primitive/fragment needs the complete list of lights and the computations are done for all light on per primitive or per fragment basis. If you got many light this would be slow.
GLSL example of fixed number of light sources see the fragment shader
Also the current GPU's have limited memory for uniforms (registers) in which the lights and other rendering parameters are stored so there are possible workarounds like have light parameters stored in a texture and iterate over all of them per primitive/fragment inside GLSL shader but the number of lights affect performance of coarse so you are limited by target frame-rate and computational power. Additional memory requirements for this is just the texture with light parameters which is not so much (few vectors per light).
light maps
they can be computed even for moving objects. Complex light maps can be computed slowly (not per frame). This leads to small lighting artifacts but you need to know what to look for to spot it. Light maps and shadow maps are very similar and often computed at once. There are simple light maps and complex radiation maps models out there
look Shading mask algorithm for radiation calculations
These are either:
projected 2D maps (hard to implement/use and often less precise)
3D Voxel maps (Memory demanding but easier to compute/use)
Some approaches uses pre-rendered Z-Buffer as geometry source and then fill the lights via Radiosity or any other technique. These can handle any number of lights as these maps can be computation demanding they are often computed in the background and updated once in a while.
fast moving light sources are usually updated more often or excluded from maps and rendered as transparent geometry to make impression of light. The computational power needed for this depends on the computation method the basic are done like:
set a camera to the larges visible surfaces
render scene and handle the result as light/shadow map
store it as 2D or 3D texture or voxel map
and then continue with normal rendering from camera view
So you need to render scene more then once per frame/map update and also need additional buffers to store the rendered result which for high resolution or Voxel maps can be a big chunk of memory.
multi pass light layer
there are cases when light is added after rendering of the scene for example I used it for
Atmospheric scattering in GLSL
Here comes all multi pass rendering techniques you need additional buffers to store the sub results and usually the multi pass rendering is done on the same view/scene so pre-rendered geometry is used which significantly speeds this up either as locked VAO or as already rendered Z-buffer Color and Index buffers from first pass. After this handle next passes as single or few Quads (like in the Atmospheric scattering link) so the computational power needed for this is not much bigger in comparison to basic BR rendering
forward rendering vs. deferred-rendering
in a google this forward rendering vs. deferred-rendering is first relevant hit I found. It is not very good one (a bit to vague for my taste) but for starters it is enough
forward rendering techniques are usually standard single pass BR renders
deffered rendering is standard multi pass renders. In first pass is rendered all the geometries of the scene into Z buffer, Color buffer and some auxiliary buffers just to know which fragment of the result belongs to which object,material,... And then in the next passes are added effects,light,shadows,... but the geometry is not rendered again instead just single or few overlay QUADs/per pass are rendered so the next passes are usually pretty fast ...
The link suggest that for high lights number is the deffered rendering more suited but that strongly depends on which of the previous technique is used. Usually the multi pass light layer is used (with is one of the standard deffered rendering techniques) so in that case it is true, and the memory and computational power demands are the same see the previous section.

Deferred Rendering with Tile-Based culling Concept Problems

EDIT: I'm still looking for some help about the use of OpenCL or compute shaders. I would prefer to keep using OGL 3.3 and not have to deal with the bad driver support for OGL 4.3 and OpenCL 1.2, but I can't think of anyway to do this type of shading without using one of the two (to match lights and tiles). Is it possible to implement tile-based culling without using GPGPU?
I wrote a deferred render in OpenGL 3.3. Right now I don't do any culling for the light pass (I just render a full screen quad for every light). This (obviously) has a ton of overdraw. (Sometimes it is ~100%). Because of this I've been looking into ways to improve performance during the light pass. It seems like the best way in (almost) everyone's opinion is to cull the scene using screen space tiles. This was the method used in Frostbite 2. I read the the presentation from Andrew Lauritzen during SIGGRAPH 2010 (http://download-software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf) , and I'm not sure I fully understand the concept. (and for that matter why it's better than anything else, and if it is better for me)
In the presentation Laurtizen goes over deferred shading with light volumes, quads, and tiles for culling the scene. According to his data, the tile based deferred renderer was the fastest (by far). I don't understand why it is though. I'm guessing it has something to do with the fact that for each tile, all the lights are batched together. In the presentation it says to read the G-Buffer once and then compute the lighting, but this doesn't make sense to me. In my mind, I would implement this like this:
for each tile {
for each light effecting the tile {
render quad (the tile) and compute lighting
blend with previous tiles (GL_ONE, GL_ONE)
}
}
This would still involve sampling the G-Buffer a lot. I would think that doing that would have the same (if not worse) performance than rendering a screen aligned quad for every light. From how it's worded though, it seems like this is what's happening:
for each tile {
render quad (the tile) and compute all lights
}
But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs . Can anyone help me with this? It also seems like almost every tile based deferred renderer uses compute shaders or OpenCL (to batch the lights), why is this, and if I didn't use these what would happen?
But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs .
It rather depends on how many lights you have. The "instruction limits" are pretty high; it's generally not something you need to worry about outside of degenerate cases. Even if 100+ lights affects a tile, odds are fairly good that your lighting computations aren't going to exceed instruction limits.
Modern GL 3.3 hardware can run at least 65536 dynamic instructions in a fragment shader, and likely more. For 100 lights, that's still 655 instructions per light. Even if you take 2000 instructions to compute the camera-space position, that still leaves 635 instructions per light. Even if you were doing Cook-Torrance directly in the GPU, that's probably still sufficient.

Is it possible to reuse glsl vertex shader output later?

I have a huge mesh(100k triangles) that needs to be drawn a few times and blend together every frame. Is it possible to reuse the vertex shader output of the first pass of mesh, and skip the vertex stage on later passes? I am hoping to save some cost on the vertex pipeline and rasterization.
Targeted OpenGL 3.0, can use features like transform feedback.
I'll answer your basic question first, then answer your real question.
Yes, you can store the output of vertex transformation for later use. This is called Transform Feedback. It requires OpenGL 3.x-class hardware or better (aka: DX10-hardware).
The way it works is in two stages. First, you have to set your program up to have feedback-based varyings. You do this with glTransformFeedbackVaryings. This must be done before linking the program, in a similar way to things like glBindAttribLocation.
Once that's done, you need to bind buffers (given how you set up your transform feedback varyings) to GL_TRANSFORM_FEEDBACK_BUFFER with glBindBufferRange, thus setting up which buffers the data are written into. Then you start your feedback operation with glBeginTransformFeedback and proceed as normal. You can use a primitive query object to get the number of primitives written (so that you can draw it later with glDrawArrays), or if you have 4.x-class hardware (or AMD 3.x hardware, all of which supports ARB_transform_feedback2), you can render without querying the number of primitives. That would save time.
Now for your actual question: it's probably not going to help buy you any real performance.
You're drawing terrain. And terrain doesn't really get any transformation. Typically you have a matrix multiplication or two, possibly with normals (though if you're rendering for shadow maps, you don't even have that). That's it.
Odds are very good that if you shove 100,000 vertices down the GPU with such a simple shader, you've probably saturated the GPU's ability to render them all. You'll likely bottleneck on primitive assembly/setup, and that's not getting any faster.
So you're probably not going to get much out of this. Feedback is generally used for either generating triangle data for later use (effectively pseudo-compute shaders), or for preserving the results from complex transformations like matrix palette skinning with dual-quaternions and so forth. A simple matrix multiply-and-go will barely be a blip on the radar.
You can try it if you like. But odds are you won't have any problems. Generally, the best solution is to employ some form of deferred rendering, so that you only have to render an object once + X for every shadow it casts (where X is determined by the shadow mapping algorithm). And since shadow maps require different transforms, you wouldn't gain anything from feedback anyway.