Early stencil culling - opengl

I'm trying to get early fragment culling to work, based on the stencil test.
My scenario is the following: I have a fragment shader that does a lot of work, but needs to be run only on very few fragments when I render my scene. These fragments can be located pretty much anywhere on the screen (I can't use a scissor to quickly filter out these fragments).
In rendering pass 1, I generate a stencil buffer with two possible values. Values will have the following meaning for pass 2:
0: do not do anything
1: ok to proceed, (eg. enter the fragment shader, and render)
Pass 2 renders the scene properly speaking. The stencil buffer is configured this way:
glStencilMask(1);
glStencilFunc(GL_EQUAL, 1, 1); // if the value is NOT 1, please early cull!
glStencilOp(GL_KEEP, GL_KEEP, GL_KEEP); // never write to stencil buffer
Now I run my app. The color of selected pixels is altered based on the stencil value, which means the stencil test works fine.
However, I should see a huge, spectacular performance boost with early stencil culling... but nothing happens. My guess is that the stencil test either happens after the depth test, or even after the fragment shader has been called. Why?
nVidia apparently has a patent on early stencil culling:
http://www.freepatentsonline.com/7184040.html
Is this the right away for having it enabled?
I'm using an nVidia GeForce GTS 450 graphics card.
Is early stencil culling supposed to work with this card?
Running Windows 7 with latest drivers.

Like early Z, early stencil is often done using hierarchical stencil buffering.
There are a number of factors that can prevent hierarchical tiling from working properly, including rendering into an FBO on older hardware. However, the biggest obstacle to getting early stencil testing working in your example is that you've left stencil writes enabled for 1/(8) bits in the second pass.
I would suggest using glStencilMask (0x00) at the beginning of the second pass to let the GPU know you are not going to write anything to the stencil buffer.
There is an interesting read on early fragment testing as it is implemented in current generation hardware here. That entire blog is well worth reading if you have the time.

Related

Is it possible to depth test against a depth texture I am also sampling, in the same draw call?

Context:
I am using a deferred rendering setup, where in the first stage I have two FBO's: one is the GBuffer, for storing the normals, albedo, and material information for all visible fragments. This FBO has a 32-bit depth texture. This gets drawn into in a geometry pass, before any lighting is calculated.
The second FBO is color-only, and starts off black, but accumulates lighting over several passes, from lighting shaders that sample from the GBuffer and write to the color-only buffer using additive blending.
The problem is, I would really like to utilize early depth testing in order to have my lighting ONLY calculate for fragments that contain actual geometry (Not just sky). The best way I can think of to do this is to use depth testing to fail any pixels that have a depth of one in the case of sunlight, or to fail any pixels that lie behind the sphere of influence for point lights. However, I don't think I can bind this depth texture to my color FBO, since I also sample from it inside the lighting shader to calculate the fragments position in world-space.
So my question is: Is there a way to use the same depth texture for both the early depth test, and for sampling inside the shader? Or if not, is there some other (reasonably performant) way of rejecting pixels that don't have geometry in them? I will not be writing to this depth texture at all in my lighting pass.
I only have to target modern graphics hardware on PC's (So I can use any common extensions, or openGL 4.6 features).
There are rules in OpenGL about reading from data in a shader that's also being updated due to a framebuffer operation. Those rules used to be quite strict. Indeed, pre-GL 4.4, the rules were so strict that what you're trying to do was actually undefined behavior. That is, if an image from a texture was attached to the rendering FBO, and you took a sample from that texture in a way such that it was at all possible to be reading from the attached image, you got undefined behavior. Never mind if your write mask meant that no writing happened; it was UB.
Fortunately, it's well-defined now. You only get UB if you're doing an actual write, not merely because you have an image attached to the FBO. And by "now," I mean basically any hardware made in the last 10 years. While ARB_texture_barrier and GL 4.5 are fairly recent, their predecessor NV_texture_barrier is actually quite old. And despite being an NVIDIA extension by name, it was so widely implemented that it is even available on MacOS implementations.

Is the stencil buffer still relevant in modern OpenGL?

Me and a friend have been having an ongoing argument about the stencil buffer. In short I haven't been able to find a situation where the stencil buffer would provide any advantage over the programmable pipeline tools in OpenGL 3.2+. Are there any uses to the stencil buffer in modern OpenGL?
[EDIT]
Thanks everyone for all the inputs on the subject.
It is more useful than ever since you can sample stencil index textures from fragment shaders. It should not even be argued that the stencil buffer is not part of the programmable pipeline.
The depth buffer is used for simple pass/fail fragment rejection, which the stencil buffer can also do as suggested in comments. However, the stencil buffer can also accumulate information about test results over multiple passes. All sorts of logic and counting applications exist such as measuring a scene's depth complexity, constructive solid geometry, etc.
To add a recent example to Andon's answer, GTA V uses the stencil buffer kinda like an ID buffer to mark the player character, cars, vegetation etc.
It subsequently uses the stencil buffer to e.g. apply subsurface scattering only to the character or exclude him from motion blur.
See the GTA V Graphics Study (highly recommended, it's a great read!)
Edit: sure you can do this in software. But you can do rasterization or tessellation in software just as well... In the end it's about performance I guess. With depth24stencil8 you have a nice hardware-supported format, and the stencil test is most likely faster then doing discards in the fragment shader.
Just to provide one other use case, shadow volumes (aka "stencil shadows") are still very relevant: https://en.wikipedia.org/wiki/Shadow_volume
They're useful for indoor scenes where shadows are supposed to be pixel perfect, and you're less likely to have alpha-tested foliage messing up the extruded shadow volumes.
It's true that shadow maps are more common, but I suspect that stencil shadows will have a comeback once the brain dead Createive/3DLabs patent expires on the zfail method.

openGl GPU optimization when rendering stacks of primitives

If I render 15 fully opaque quads with the same size on top of each other, depth test disabled, is the GPU hardware /software cleaver enough just to process the topmost quad and discard the other vertices/fragments? Or would one benefit from using the stencil buffer to achieve the same effect?
Most GPUs will overdraw in this scenario which will be very bad for performance if your quads are large. Rather than use the stencil buffer, the best way to optimise is probably to enable depth testing, assign appropriate depth values and render your quads front to back.
However, under certain conditions (e.g. no blending) tile based deferred rendering (TBDR) GPUs common in many mobile devices (particularly PowerVR devices used by all iOS devices and many Android devices) will do a process known as hidden surface removal (HSR) which will optimize this case and avoid rendering the pixels that will be obscured.
Definitely, it will generate fragments for all of the opaque quads. Also, if you disable the depth test, you may see the back surface on the screen. Because the depth testing is disable and whoever is rendered last will draw the screen.
Even, if you use the stencil buffer the fragments are still generated for the quads, pass through stencil and depth tests.

how to do customized stencil test in fragment shader

In my project, I used 'discard' call to perform customized stencil test, which tries to draw things only on a specified area defined by a stencil texture. Here is the code from fragment shader:
//get the stencil value from a texture
float value=texture2D( stencilTexture, gl_FragCoord.xy/1024.0).x;
//check if value equals the desired value, if not draw nothing
if(abs(value-desiredValue)>0.1)
{
discard;
}
This code works, but suffers from a performance problem because of the 'discard' call. Is there an alternative way to do this through GPU Shaders? Tell me how.
If you access a texture, you must suffer the performance penalties associated with accessing a texture. In the same way, if you want to stop a fragment from being rendered, you must suffer the performance penalties associated with stopping fragments from being rendered.
This will be true regardless of how you stop that fragment. Whether it's a true stencil test, your shader-based discard, or alpha testing, all of these will encounter the same general performance issues (for hardware where discard leads to any significant performance problems, which is mainly mobile hardware). The only exception is the depth test, and that's because of why certain hardware has problems with discard.
For platforms where discard has a substantial impact in performance, the rendering algorithm works most optimally if the hardware can assume that the depth is the final arbiter of whether a fragment will be rendered (and thus, the fragment with the highest/lowest depth always wins). Therefore, any method of culling the fragment other than the depth test will interfere with this optimization.

GLSL Interlacing

I would like to efficiently render in an interlaced mode using GLSL.
I can alrdy do this like:
vec4 background = texture2D(plane[5], gl_TexCoord[1].st);
if(is_even_row(gl_TexCoord[1].t))
{
vec4 foreground = get_my_color();
gl_FragColor = vec4(fore.rgb * foreground .a + background .rgb * (1.0-foreground .a), background .a + fore.a);
}
else
gl_FragColor = background;
However, as far as I have understood the nature of branching in GLSL is that both branches will actually be executed, since "even_row" is considered as run-time value.
Is there any trick I can use here in order to avoid unnecessarily calling the rather heavy function "get_color"? The behavior of is_even_row is quite static.
Or is there some other way to do this?
NOTE: glPolygonStipple will not work since I have custom blend functions in my GLSL code.
(comment to answer, as requested)
The problem with interlacing is that GPUs run shaders in 2x2 clusters, which means that you gain nothing from interlacing (a good software implementation might possibly only execute the actual pixels that are needed, unless you ask for partial derivatives).
At best, interlacing runs at the same speed, at worst it runs slower because of the extra work for the interlacing. Some years ago, there was an article in ShaderX4, which suggested interlaced rendering. I tried that method on half a dozen graphics cards (3 generations of hardware of each the "two big" manufacturers), and it ran slower (sometimes slightly, sometimes up to 50%) in every case.
What you could do is do all the expensive rendering in 1/2 the vertical resolution, this will reduce the pixel shader work (and texture bandwidth) by 1/2. You can then upscale the texture (GL_NEAREST), and discard every other line.
The stencil test can be used to discard pixels before the pixel shader is executed. Of course the hardware still runs shaders in 2x2 groups, so in this pass you do not gain anything. However, that does not matter if it's just the very last pass, which is a trivial shader writing out a single fetched texel. The more costly composition shaders (the ones that matter!) run at half resolution.
You find a detailled description including code here: fake dynamic branching. This demo avoids lighting pixels by discarding those that are outside the light's range using the stencil.
Another way which does not need the stencil buffer is to use "explicit Z culling". This may in fact be even easier and faster.
For this, clear Z, disable color writes (glColorMask), and draw a fullscreen quad whose vertices have some "close" Z coordinate, and have the shader kill fragments in every odd line (or use the deprecated alpha test if you want, or whatever). gl_FragCoord.y is a very simple way of knowing which line to kill, using a small texture that wraps around would be another (if you must use GLSL 1.0).
Now draw another fullscreen quad with "far away" Z values in the vertices (and with depth test, of course). Simply fetch your half-res texture (GL_NEAREST filtering), and write it out. Since the depth buffer has a value that is "closer" in every other row, it will discard those pixels.
How does glPolygonStipple compare to this? Polygon stipple is a deprecated feature, because it is not directly supported by the hardware and has to be emulated by the driver either by "secretly" rewriting the shader to include extra logic or by falling back to software.
This is probably not the right way to do interlacing. If you really need to achieve this effect, don't do it in the fragment shader like this. Instead, here is what you could do:
Initialize a full screen 1-bit stencil buffer, where each bit stores the parity of its corresponding row.
Render your scene like usual to a temporary FBO with 1/2 the vertical resoltion.
Turn on the stencil test, and switch the stencil func depending on which set of scan lines you are going to draw.
Blit a rescaled version of the aforementioned fbo (containing the contents of your frame) to the stencil buffer.
Note that you could skip the offscreen FBO step and draw directly using the stencil buffer, but this would waste some fill rate testing those pixels that are just going to clipped anyway. If your program is shader heavy, the solution I just mentioned would be optimal. If it is not, you may end up being marginally better off drawing directly to the screen.