how to do customized stencil test in fragment shader - opengl

In my project, I used 'discard' call to perform customized stencil test, which tries to draw things only on a specified area defined by a stencil texture. Here is the code from fragment shader:
//get the stencil value from a texture
float value=texture2D( stencilTexture, gl_FragCoord.xy/1024.0).x;
//check if value equals the desired value, if not draw nothing
This code works, but suffers from a performance problem because of the 'discard' call. Is there an alternative way to do this through GPU Shaders? Tell me how.

If you access a texture, you must suffer the performance penalties associated with accessing a texture. In the same way, if you want to stop a fragment from being rendered, you must suffer the performance penalties associated with stopping fragments from being rendered.
This will be true regardless of how you stop that fragment. Whether it's a true stencil test, your shader-based discard, or alpha testing, all of these will encounter the same general performance issues (for hardware where discard leads to any significant performance problems, which is mainly mobile hardware). The only exception is the depth test, and that's because of why certain hardware has problems with discard.
For platforms where discard has a substantial impact in performance, the rendering algorithm works most optimally if the hardware can assume that the depth is the final arbiter of whether a fragment will be rendered (and thus, the fragment with the highest/lowest depth always wins). Therefore, any method of culling the fragment other than the depth test will interfere with this optimization.


Optimize for expensive fragment shader

I'm rendering multiple layers of flat triangles with a raytracer in the fragment shader. The upper layers have holes, and I'm looking for a way how I can avoid running the shader for pixels that are filled already by one of the upper layers, i.e. I want only the parts of the lower layers rendered that lie in the holes in the upper layers. Of course, if there's a hole or not is not known unless the fragment shader did his thing for a layer.
As far as I understand, I cannot use early depth testing because there, the depth values are interpolated between the vertices and do not come from the fragment shader. Is there a way to "emulate" that behavior?
The best way to solve this issue is to not use layers. You are only using layers because of the limitations of using a 3D texture to store your scene data. So... don't do that.
SSBOs and buffer textures (if your hardware is too old for SSBOs) can access more memory than a 3D texture. And you could even employ manual swizzling of the data to improve cache locality if that is viable.
As far as I understand, I cannot use early depth testing because there, the depth values are interpolated between the vertices and do not come from the fragment shader.
This is correct insofar as you cannot use early depth tests, but it is incorrect as to why.
The "depth" provided by the VS doesn't need to be the depth of the actual fragment. You are rendering your scene in layers, presumably with each layer being a full-screen quad. By definition, everything in one layer of rendering is beneath everything in a lower layer. So the absolute depth value doesn't matter; what matters is whether there is something from a higher layer over this fragment.
So each layer could get its own depth value, with lower layers getting a lower depth value. The exact value is arbitrary and irrelevant; what matters is that higher layers have higher values.
The reason this doesn't work is this: if your raytracing algorithm detects a miss within a layer (a "hole"), you must discard that fragment. And the use of discard at all turns off early depth testing in most hardware, since the depth testing logic is usually tied to the depth writing logic (it is an atomic read/conditional-modify/conditional-write).

Reading FBO depth attachment whilst depth testing

I'm working with a deferred rendering engine using OpenGL 3.3. I have an FBO set up as my G-buffer with a texture attached as the depth component.
In my lighting pass I need to depth test (with writes disabled) to cull unnecessary pixels. However, I'm currently writing code which will reconstruct world position coordinates, this will also need access to the depth buffer.
Is it legal in Opengl 3.3 to bind a depth attachment to a texture unit and sample it whilst also using it for depth testing in the same pass?
I can't find anything specifically about it in the docs but my gut tells me that using the same buffer/texture for two different purposes will produce undefined behaviour. Does anybody know for sure? I have a limited set of hardware to test on and don't want to make false assumptions about what works.
At the very least this creates a situation where memory coherency cannot be guaranteed (coherency is something you sort of assume at all stages in the traditional pipeline pre-GL4 and have no standardized control over either).
The driver just might cache this memory in an undesirable way since this behavior is undefined. You would like to think that an appropriate combination of writemask and sampling would be a strong hint not to do that, but that is all up to whoever designed the driver and your results will tend to vary by hardware vendor, platform and hardware generation.
This scenario is a use-case for things like NV's texture barrier extension, but that is vendor specific and still does not tackle the problem entirely. If you want to do this sort of thing portably, your best bet is to promote the engine to GL4 and use standardized features for early fragment tests, barriers, etc.
Does your composite pass really need a depth buffer in the first place though? It sounds like you want to re-construct per-pixel position during lighting from the stored depth buffer. That's entirely possible in a framebuffer with no depth attachment at all.
Your G-Buffers will already be filled at this point, and after that you no longer need to do any fragment tests. The one fragment that passed all previous tests is what's eventually written to the G-Buffer and there's no reason to apply any additional tests to it when it comes time to do lighting.

What happens to the depth buffer if I discard a fragment in a shader using early_fragment_tests?

I'm using a fragment shader which discards some fragments using the discard keyword. My shader also uses the early_fragment_tests ( image store load obliges ).
I do not write the gl_FragDepth, I let the standard OpenGL handle the depth value.
Will my depth buffer be updated with the fragment's depth before the discard keyword is executed?
It does not seems like it on my NVidia Quadro 600 and K5000.
Any clue where I could find that information? FYI, I searched I found close enough topics but not that particular one.
Will my depth buffer be updated with the fragment's depth before the discard keyword is executed?
No, this sort of behavior is explicitly forbidden in a shader that contains discard or that writes an arbitrary value to gl_FragDepth. This is because in such a shader, the depth of your fragment after it is shaded may be unrelated the position generated during initial rasterization (pre-shading).
Without writing to gl_FragDepth or discarding, the depth of a fragment is actually known long before the actual fragment shader executes and this forms the foundation for early depth tests. Rasterization/shading can be skipped for some (individual tiled regions) or all of a primitive if it can be determined that it would have failed a depth test before the fragment shader is evaluated, but if the fragment shader itself is what determines a fragment's depth, then all bets are off.
There is an exception to this rule in DX11 / OpenGL 4.x. If you write your shaders in such a way that you can guarantee the output depth will always preserve the result of a depth test (same result as the depth generated during rasterization), early fragment tests can be enabled in a shader that uses discard or writes to gl_FragDepth. This feature is known as conservative depth, and unless you use this it is generally understood that discard is going to break early depth optimizations across the board.
Now, since you should never write to the depth buffer before you know whether the value you are writing passes or fails a depth test (gl_FragDepth may be different) or if the fragment even survives (discard may be used), you can see why a primitive shaded by a fragment shader that contains discard cannot write to the depth buffer before the shader is evaluated.
I think the information you are looking for is on that page:
If early fragment tests are enabled, any depth value computed by the
fragment shader has no effect. Additionally, the depth buffer, stencil
buffer, and occlusion query sample counts may be updated even for
fragments or samples that would be discarded after fragment shader
execution due to per-fragment operations such as alpha-to-coverage or
alpha tests.
The word "may" in "the depth buffer, [etc.] may be updated", implies it is implementation dependent (or completely random).

GLSL Interlacing

I would like to efficiently render in an interlaced mode using GLSL.
I can alrdy do this like:
vec4 background = texture2D(plane[5], gl_TexCoord[1].st);
vec4 foreground = get_my_color();
gl_FragColor = vec4(fore.rgb * foreground .a + background .rgb * (1.0-foreground .a), background .a + fore.a);
gl_FragColor = background;
However, as far as I have understood the nature of branching in GLSL is that both branches will actually be executed, since "even_row" is considered as run-time value.
Is there any trick I can use here in order to avoid unnecessarily calling the rather heavy function "get_color"? The behavior of is_even_row is quite static.
Or is there some other way to do this?
NOTE: glPolygonStipple will not work since I have custom blend functions in my GLSL code.
(comment to answer, as requested)
The problem with interlacing is that GPUs run shaders in 2x2 clusters, which means that you gain nothing from interlacing (a good software implementation might possibly only execute the actual pixels that are needed, unless you ask for partial derivatives).
At best, interlacing runs at the same speed, at worst it runs slower because of the extra work for the interlacing. Some years ago, there was an article in ShaderX4, which suggested interlaced rendering. I tried that method on half a dozen graphics cards (3 generations of hardware of each the "two big" manufacturers), and it ran slower (sometimes slightly, sometimes up to 50%) in every case.
What you could do is do all the expensive rendering in 1/2 the vertical resolution, this will reduce the pixel shader work (and texture bandwidth) by 1/2. You can then upscale the texture (GL_NEAREST), and discard every other line.
The stencil test can be used to discard pixels before the pixel shader is executed. Of course the hardware still runs shaders in 2x2 groups, so in this pass you do not gain anything. However, that does not matter if it's just the very last pass, which is a trivial shader writing out a single fetched texel. The more costly composition shaders (the ones that matter!) run at half resolution.
You find a detailled description including code here: fake dynamic branching. This demo avoids lighting pixels by discarding those that are outside the light's range using the stencil.
Another way which does not need the stencil buffer is to use "explicit Z culling". This may in fact be even easier and faster.
For this, clear Z, disable color writes (glColorMask), and draw a fullscreen quad whose vertices have some "close" Z coordinate, and have the shader kill fragments in every odd line (or use the deprecated alpha test if you want, or whatever). gl_FragCoord.y is a very simple way of knowing which line to kill, using a small texture that wraps around would be another (if you must use GLSL 1.0).
Now draw another fullscreen quad with "far away" Z values in the vertices (and with depth test, of course). Simply fetch your half-res texture (GL_NEAREST filtering), and write it out. Since the depth buffer has a value that is "closer" in every other row, it will discard those pixels.
How does glPolygonStipple compare to this? Polygon stipple is a deprecated feature, because it is not directly supported by the hardware and has to be emulated by the driver either by "secretly" rewriting the shader to include extra logic or by falling back to software.
This is probably not the right way to do interlacing. If you really need to achieve this effect, don't do it in the fragment shader like this. Instead, here is what you could do:
Initialize a full screen 1-bit stencil buffer, where each bit stores the parity of its corresponding row.
Render your scene like usual to a temporary FBO with 1/2 the vertical resoltion.
Turn on the stencil test, and switch the stencil func depending on which set of scan lines you are going to draw.
Blit a rescaled version of the aforementioned fbo (containing the contents of your frame) to the stencil buffer.
Note that you could skip the offscreen FBO step and draw directly using the stencil buffer, but this would waste some fill rate testing those pixels that are just going to clipped anyway. If your program is shader heavy, the solution I just mentioned would be optimal. If it is not, you may end up being marginally better off drawing directly to the screen.

Quick question about glColorMask and its work

I want to render depth buffer to do some nice shadow mapping. My drawing code though, consists of many shader switches. If I set glColorMask(0,0,0,0) and leave all shader programs, textures and others as they are, and just render the depth buffer, will it be 'OK' ? I mean, if glColorMask disables the "write of color components", does it mean that per-fragment shading IS NOT going to be performed?
For rendering a shadow map, you will normally want to bind a depth texture (preferrably square and power of two, because stereo drivers take this as hint!) to a FBO and use exactly one shader (as simple as possible) for everything. You do not want to attach a color buffer, because you are not interested in color at all, and it puts more unnecessary pressure on ROP (plus, some hardware can render double speed or more with depth-only). You do not want to switch between many shaders.
Depending on whether you do "classic" shadow mapping, or something more sophisticated such as exponential shadow maps, the shader that you will use is either as simple as it can be (constant color, and no depth write), or performs some (moderately complex) calculations on depth, but you normally do not want to perform any colour calculations, since that will mean needless calculations which will not be visible in any way.
No, the fragment operations will be performed anyway, but their result will be squashed by your zero color mask.
If you don't want some fragment operations to be performed - use the proper shader program which has an empty fragment shader attached and set the draw buffer to GL_NONE.
There is another way to disable fragment processing - to enable GL_RASTERIZER_DISCARD, but you won't get even the depth values in this case :)
No, the shader programs execute independent of the fixed function pipeline. Setting the glColorMask will have no effect on the shader programs.