Reading FBO depth attachment whilst depth testing

Reading FBO depth attachment whilst depth testing - opengl

I'm working with a deferred rendering engine using OpenGL 3.3. I have an FBO set up as my G-buffer with a texture attached as the depth component.
In my lighting pass I need to depth test (with writes disabled) to cull unnecessary pixels. However, I'm currently writing code which will reconstruct world position coordinates, this will also need access to the depth buffer.
Is it legal in Opengl 3.3 to bind a depth attachment to a texture unit and sample it whilst also using it for depth testing in the same pass?
I can't find anything specifically about it in the docs but my gut tells me that using the same buffer/texture for two different purposes will produce undefined behaviour. Does anybody know for sure? I have a limited set of hardware to test on and don't want to make false assumptions about what works.

At the very least this creates a situation where memory coherency cannot be guaranteed (coherency is something you sort of assume at all stages in the traditional pipeline pre-GL4 and have no standardized control over either).
The driver just might cache this memory in an undesirable way since this behavior is undefined. You would like to think that an appropriate combination of writemask and sampling would be a strong hint not to do that, but that is all up to whoever designed the driver and your results will tend to vary by hardware vendor, platform and hardware generation.
This scenario is a use-case for things like NV's texture barrier extension, but that is vendor specific and still does not tackle the problem entirely. If you want to do this sort of thing portably, your best bet is to promote the engine to GL4 and use standardized features for early fragment tests, barriers, etc.
Does your composite pass really need a depth buffer in the first place though? It sounds like you want to re-construct per-pixel position during lighting from the stored depth buffer. That's entirely possible in a framebuffer with no depth attachment at all.
Your G-Buffers will already be filled at this point, and after that you no longer need to do any fragment tests. The one fragment that passed all previous tests is what's eventually written to the G-Buffer and there's no reason to apply any additional tests to it when it comes time to do lighting.

Related

Pixelshader: Feedback on whether anything rendered?

Is there any way to get feedback from a pixel shader on whether a pixel actually rendered (as opposed to being blocked by zbuffer or stencil buffer)? I'm using GLSL.
I'm trying to determine if a rendered object is visible at all to the camera. Like if I was doing it in pure software, I'd set a boolean false, and turn it true if any pixel actually passed the z and stencil tests.
Any way to do this, via trickery or otherwise?

A fragment shader cannot ask questions of other fragments (except in very isolated circumstances). Nor can a fragment shader peer into the future and ask questions of fragments that have yet to be generated by the rasterizer. As such, a fragment shader cannot know if some other fragment in the same drawing command passed various tests.
Your application can ask these questions, via an occlusion query. You can get a report on whether sampled generated by the drawing commands in the query scope passed. You can even get (an estimate of) the number of samples which passed.
Of course, getting the information is one thing. Using it in a performance-friendly way is another. After all, GPU commands are executed asynchronously. So the answer to this question may not be known until many milliseconds after you issued the commands. And you'd probably prefer not to have the CPU sitting there while the GPU processes stuff.

In OpenGL, (how) can I depth test between two depth buffers?

Say I have two depth buffers ("DA" and "DB"). I'd like to attach DA to a framebuffer, and render to it in the normal/"traditional" way (every fragment depth tests against GL_LESS, if it passes, it is drawn and it's depth is written to the depth buffer).
Then, I'd like to attach DB- here I'd want to reference/use DB in the traditional way, but also depth test against DA with GL_GREATER. If the fragment passes both tests, I'd still just like to write depth to DB.
What this would accomplish is anything drawn in the second pass would only draw if it is behind the contents of the first pass, but in front of any other contents in the second pass.
Does OpenGL 2.1 (or OpenGL ES 2) offer any of this functionality? What could I do to work around it not formally offering it?
If it doesn't offer it, here's an idea for how I'd do it, that I would like to be sanity checked...
You attach a render buffer that acts as a depth buffer, then manually populate it with depths. And manually perform the "if depth greater than texel at DA" test (when failing, discard the fragment). I'm curious what problems I'd run into? Is there an easy way to emulate the depth precision of formally specified depth buffers? Would the performance be awful? Any other concerns?

I don't believe there is a way to have multiple depth buffers. I thought there might at least be vendor specific extensions, but a search through the extension registry came up empty even for that.
What you already started considering in the last paragraph looks like the only reasonably straightforward option that comes to mind. You use a depth texture as your second depth buffer, and discard fragments based on their comparison with the value from the depth texture in your fragment shader.
You don't have to manually populate the depth texture. If the depth values are generated as the result of an earlier rendering pass, you can copy them from the primary depth buffer to a depth texture using glCopyTexImage2D(). This function is available in OpenGL 2.1. In newer versions, you can also use glBlitFramebuffer() to perform the copy.
On your concerns:
Precision is a very valid concern. I had some fairly severe precision issues when I played with something similar in the past. They can probably be solved, but you definitely need to watch out for it, and carefully think about what your precision requirements are, and how you can fudge the values if necessary.
I would certainly expect performance to suffer somewhat. The fixed function depth buffer has the advantage of heavy optimizations, like early depth testing, that you won't get with your "home made" depth buffer. I wouldn't necessarily expect it to be horrible, though. And if this is what you need, it's worth trying.
I think this is pretty much a no go in ES 2.0, at least without relying on extensions. You could potentially consider writing depth values into color textures, but I think it would get very ugly and inefficient. ES 2.0 is great for its simplicity. But for what you're planning to do, you're exceeding its intended feature set. ES 3.0 adds depth textures, and some other features that could come into play.

OpenGL state redundancy elimination Tree, render state priorities

I am working on a Automatic OpenGL batching method in my Game Engine, to reduce draw calls and redundant calls.
My batch tree design begins with the most expensive states and adds leafs down for each less expensive state.
Example:
Tree Root: Shaders / Programs
Siblings: Blend states ...
a.s.o.
So my question is what are most likely the most expensive calls, in this list:
binding program
binding textures
binding buffers
buffering texture, vertex data
binding render targets
glEnable / glDisable
blend state equation, color, functions, colorWriteMask
depth stencil state depthFunc, stencilOperations, stencilFunction, writeMasks
Also wondering which method will be faster:
- Collect all batchable draw commands to single vertex buffer and call only 1 draw call (this method would also force to update matrix transforms per vertex on cpu side)
- Do not batch at all and render many small draw calls, only batch particle system ...
PS: Render Targets will always Pre or Post changed, depending on usage.
Progress so far:
Andon M. Coleman: Cheapest Uniform & Vertex Array Binding, Expensive FBO, Texture Bindings
datenwolf: Programs invalidate State Cache
1: Framebuffer states
2: Program
3: Texture Binding
...
N: Vertex Array binding, Uniform binding
Current execution Tree in WebGL:
Program
Attribute Pointers
Texture
Blend State
Depth State
Stencil Front / Back State
Rasterizer State
Sampler State
Bind Buffer
Draw Arrays
Each step is a sibling hash tree, to avoid checking agains state cache inside of main render queue
Loading Textures / Programs / Shaders / Buffers happens before rendering in an extra queue, for future multi threading and also to be sure that the context is initialized before doing anything with it.
The biggest problem of self rendering objects is that you cannot control when something happens, for example if a developer calls these methods before gl is initialized, he wouldn't know why but he would have some bugs or problems...

The relative costs of such operations will of course depend on the usage pattern and your general scenario. But you might find Nvidia's "Beoynd Porting" presentation slides as a useful guide. Let me reproduce especially slide 48 here:
Relative Cost of state changes
In decreasing cost...
Render Target ~60K/s
Program ~300K/s
ROP
Texture Bindings ~1.5M/s
Vertex Format
UBO Bindings
Uniform Updates ~10M/s
This does not directly match all of the bullet points of your list. E.g. glEnable/glDisable might affect anything. Also GL's buffer bindings are nothing the GPU directly sees. Buffer bindings are mainly a client side state, depending on the target, of course. Change of blending state would be a ROP state change, and so on.

This tends to be highly platform/vendor dependent. Any numbers you may find apply to a specific GPU, platform and driver version. And there are a lot of myths floating around on the internet about this topic. If you really want to know, you need to write some benchmarks, and run them across a range of platforms.
With all these caveats:
Render target (FBO) switching tends to be quite expensive. Highly platform and architecture dependent, though. For example if you have some form of tile based architecture, pending rendering that would ideally be deferred until the end of the frame may have to be completed and flushed out. Or on more "classic" architectures, there might be compressed color buffers or buffers used for early depth testing that need consideration when render targets are switched.
Updating texture or buffer data is impossible to evaluate in general terms. It obviously depends heavily on how much data is being updated. Contrary to some claims on the internet, calls like glBufferSubData() and glTexSubImage2D() do not typically cause a synchronization. But they involve data copies.
Binding programs should not be terribly expensive, but is typically still more heavyweight than the state changes below.
Texture binding is mostly relatively cheap. But it really depends on the circumstances. For example, if you use a GPU that has VRAM, and the texture is not in VRAM at the moment, it might trigger a copy of the texture data from system memory to VRAM.
Uniform updates. This is supposedly very fast on some platforms. But it's actually moderately expensive on others. So there's a lot of variability here.
Vertex state setup (including VBO and VAO binding) is typically fast. It has to be, because it's done so frequently by most apps that it can very quickly become a bottleneck. But there are similar consideration as for textures, where buffer memory may have to be copied/mapped if it was not used recently.
General state updates, like blend states, stencil state, or write masks, are generally very fast. But there can be very substantial exceptions.
Just a typical example of why characteristics can be so different between architectures: If you change blend state, that might be sending a couple of command words on one architecture, with minimal overhead. On other architectures, blending is done as part of the fragment shader. So if you change blend state, the shader program has to be modified to patch in the code for the new blending calculation.

Get results of GPU calculations back to the CPU program in OpenGL

Is there a way to get results from a shader running on a GPU back to the program running on the CPU?
I want to generate a polygon mesh from simple voxel data based on a computational costly algorithm on the GPU but I need the result on the CPU for physics calculations.

Define "the results"?
In general, if you're doing GPGPU-style computations with OpenGL, you are going to need to structure your shaders around the needs of a rendering system. Rendering systems are designed to be one-way: data goes into them and an image is produced. Going backwards, having the rendering system produce data, is not generally how rendering systems are structured.
That doesn't mean you can't do it, of course. But you need to architect everything around the limitations of OpenGL.
OpenGL offers a number of hooks where you can write data from certain shader stages. Most of these require specialized hardware
Fragment shader outputs
Any hardware capable of fragment shaders will obviously allow you to write to the current framebuffer you're rendering. Through the use of framebuffer objects and textures with floating-point or integer image formats, you can write pretty much any data you want to a variety of images. Once in a texture, you can simply call glGetTexImage to get the rendered pixel data. Or you can just do glReadPixels to get it if the FBO is still bound. Either way works.
The primary limitations of this method are:
The number of images you can attach to the framebuffer; this limits the amount of data you can write. On pre-GL 3.x hardware, FBOs were typically limited to only 4 images plus a depth/stencil buffer. In 3.x and better hardware, you can expect a minimum of 8 images.
The fact that you're rendering. This means that you need to set up your vertex data to position a triangle exactly where you want it to modify data. This is not a trivial undertaking. It's also difficult to get useful input data, since you typically want each texel to be fairly independent of the other. Structuring your fragment shader around these limitations is difficult. Not impossible, but non-trivial in many cases.
Transform Feedback
This OpenGL 3.0 feature allows the output from the Vertex Processing stage of OpenGL (vertex shader and optional geometry shader) to be captured in one or more buffer objects.
This is much more natural for capturing vertex data that you want to play with or render again. In your case, you'll need to read it back after rendering it, perhaps with a glGetBufferSubData call, or by using glMapBufferRange for reading.
The limitations here are that you generally only can capture 4 output values, where each value is a vec4. There are also some strict layout restrictions. Some OpenGL 3.x and 4.x hardware offers the ability to write data to multiple feedback streams, which can all be written into different buffers.
Image Load/Store
This GL 4.2 feature represents the pinnacle of writing: you can bind an image (a buffer texture, if you want to write to a buffer), and just write to it. There are memory ordering constraints that you need to work within.
It's very flexible, but very complex. Besides the difficulty in using it properly, there are a number of limitations. The number of images you can write to will be fairly limited, perhaps 8 or so. And implementations may have total write limits, so that 8 images to write to may have to be shared by the fragment shader's outputs.
What's more, image outputs are only guaranteed for the fragment shader (and 4.3's compute shaders). That is, hardware is allowed to forbid you from using image load/store on non-FS/CS shader stages.

Is it possible to reuse glsl vertex shader output later?

I have a huge mesh(100k triangles) that needs to be drawn a few times and blend together every frame. Is it possible to reuse the vertex shader output of the first pass of mesh, and skip the vertex stage on later passes? I am hoping to save some cost on the vertex pipeline and rasterization.
Targeted OpenGL 3.0, can use features like transform feedback.

I'll answer your basic question first, then answer your real question.
Yes, you can store the output of vertex transformation for later use. This is called Transform Feedback. It requires OpenGL 3.x-class hardware or better (aka: DX10-hardware).
The way it works is in two stages. First, you have to set your program up to have feedback-based varyings. You do this with glTransformFeedbackVaryings. This must be done before linking the program, in a similar way to things like glBindAttribLocation.
Once that's done, you need to bind buffers (given how you set up your transform feedback varyings) to GL_TRANSFORM_FEEDBACK_BUFFER with glBindBufferRange, thus setting up which buffers the data are written into. Then you start your feedback operation with glBeginTransformFeedback and proceed as normal. You can use a primitive query object to get the number of primitives written (so that you can draw it later with glDrawArrays), or if you have 4.x-class hardware (or AMD 3.x hardware, all of which supports ARB_transform_feedback2), you can render without querying the number of primitives. That would save time.
Now for your actual question: it's probably not going to help buy you any real performance.
You're drawing terrain. And terrain doesn't really get any transformation. Typically you have a matrix multiplication or two, possibly with normals (though if you're rendering for shadow maps, you don't even have that). That's it.
Odds are very good that if you shove 100,000 vertices down the GPU with such a simple shader, you've probably saturated the GPU's ability to render them all. You'll likely bottleneck on primitive assembly/setup, and that's not getting any faster.
So you're probably not going to get much out of this. Feedback is generally used for either generating triangle data for later use (effectively pseudo-compute shaders), or for preserving the results from complex transformations like matrix palette skinning with dual-quaternions and so forth. A simple matrix multiply-and-go will barely be a blip on the radar.
You can try it if you like. But odds are you won't have any problems. Generally, the best solution is to employ some form of deferred rendering, so that you only have to render an object once + X for every shadow it casts (where X is determined by the shadow mapping algorithm). And since shadow maps require different transforms, you wouldn't gain anything from feedback anyway.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js