OpenGL state redundancy elimination Tree, render state priorities - opengl

I am working on a Automatic OpenGL batching method in my Game Engine, to reduce draw calls and redundant calls.
My batch tree design begins with the most expensive states and adds leafs down for each less expensive state.
Example:
Tree Root: Shaders / Programs
Siblings: Blend states ...
a.s.o.
So my question is what are most likely the most expensive calls, in this list:
binding program
binding textures
binding buffers
buffering texture, vertex data
binding render targets
glEnable / glDisable
blend state equation, color, functions, colorWriteMask
depth stencil state depthFunc, stencilOperations, stencilFunction, writeMasks
Also wondering which method will be faster:
- Collect all batchable draw commands to single vertex buffer and call only 1 draw call (this method would also force to update matrix transforms per vertex on cpu side)
- Do not batch at all and render many small draw calls, only batch particle system ...
PS: Render Targets will always Pre or Post changed, depending on usage.
Progress so far:
Andon M. Coleman: Cheapest Uniform & Vertex Array Binding, Expensive FBO, Texture Bindings
datenwolf: Programs invalidate State Cache
1: Framebuffer states
2: Program
3: Texture Binding
...
N: Vertex Array binding, Uniform binding
Current execution Tree in WebGL:
Program
Attribute Pointers
Texture
Blend State
Depth State
Stencil Front / Back State
Rasterizer State
Sampler State
Bind Buffer
Draw Arrays
Each step is a sibling hash tree, to avoid checking agains state cache inside of main render queue
Loading Textures / Programs / Shaders / Buffers happens before rendering in an extra queue, for future multi threading and also to be sure that the context is initialized before doing anything with it.
The biggest problem of self rendering objects is that you cannot control when something happens, for example if a developer calls these methods before gl is initialized, he wouldn't know why but he would have some bugs or problems...

The relative costs of such operations will of course depend on the usage pattern and your general scenario. But you might find Nvidia's "Beoynd Porting" presentation slides as a useful guide. Let me reproduce especially slide 48 here:
Relative Cost of state changes
In decreasing cost...
Render Target ~60K/s
Program ~300K/s
ROP
Texture Bindings ~1.5M/s
Vertex Format
UBO Bindings
Uniform Updates ~10M/s
This does not directly match all of the bullet points of your list. E.g. glEnable/glDisable might affect anything. Also GL's buffer bindings are nothing the GPU directly sees. Buffer bindings are mainly a client side state, depending on the target, of course. Change of blending state would be a ROP state change, and so on.

This tends to be highly platform/vendor dependent. Any numbers you may find apply to a specific GPU, platform and driver version. And there are a lot of myths floating around on the internet about this topic. If you really want to know, you need to write some benchmarks, and run them across a range of platforms.
With all these caveats:
Render target (FBO) switching tends to be quite expensive. Highly platform and architecture dependent, though. For example if you have some form of tile based architecture, pending rendering that would ideally be deferred until the end of the frame may have to be completed and flushed out. Or on more "classic" architectures, there might be compressed color buffers or buffers used for early depth testing that need consideration when render targets are switched.
Updating texture or buffer data is impossible to evaluate in general terms. It obviously depends heavily on how much data is being updated. Contrary to some claims on the internet, calls like glBufferSubData() and glTexSubImage2D() do not typically cause a synchronization. But they involve data copies.
Binding programs should not be terribly expensive, but is typically still more heavyweight than the state changes below.
Texture binding is mostly relatively cheap. But it really depends on the circumstances. For example, if you use a GPU that has VRAM, and the texture is not in VRAM at the moment, it might trigger a copy of the texture data from system memory to VRAM.
Uniform updates. This is supposedly very fast on some platforms. But it's actually moderately expensive on others. So there's a lot of variability here.
Vertex state setup (including VBO and VAO binding) is typically fast. It has to be, because it's done so frequently by most apps that it can very quickly become a bottleneck. But there are similar consideration as for textures, where buffer memory may have to be copied/mapped if it was not used recently.
General state updates, like blend states, stencil state, or write masks, are generally very fast. But there can be very substantial exceptions.
Just a typical example of why characteristics can be so different between architectures: If you change blend state, that might be sending a couple of command words on one architecture, with minimal overhead. On other architectures, blending is done as part of the fragment shader. So if you change blend state, the shader program has to be modified to patch in the code for the new blending calculation.

Related

Efficiently update Uniform Buffer Objects with instancing and culling

I've successfully updated my rendering engine to use uniform buffer objects and instancing.
The problem is that, since I do a first frustum culling pass every frame in order to know the objects I need to draw, I have to update the buffers every frame because the objects I draw could change every time, and this isn't the most efficient thing.
How could I make this more efficient?
The only thing I could think of is to not do the frustum culling so all the buffers remain static and I don't need to update them all the times, but not doing frustum culling I'd end up draw a lot of unnecessary objects.
Updating uniform buffers is fairly cheap, to be honest. You are quite limited in size and that prevents you from doing anything too crazy.
What you need to focus on to make this efficient is actually accommodating incomplete commands that are queued up. You are more likely to run into problems where the driver/GPU is forced to stop working on the next frame/command due to poor data write patterns than you are to run into data transfer rate limitations. The problem is always going to be avoiding situations where you might write to portions of data that are still in use by the GPU (it is often working on data 1-2 frames behind the CPU).
You have multiple options depending on your target version, and the OpenGL Wiki has a general overview of buffer streaming approaches.
You will have to do some performance testing to say for sure, but I suspect that CPU-side frustum culling combined with buffer orphaning of your instance UBO will give good results. Rather than reusing any data from previous frames, you would just stream the entire instance UBO from CPU to GPU each frame and let the GPU discard the old UBO when it finishes each frame.

Reading FBO depth attachment whilst depth testing

I'm working with a deferred rendering engine using OpenGL 3.3. I have an FBO set up as my G-buffer with a texture attached as the depth component.
In my lighting pass I need to depth test (with writes disabled) to cull unnecessary pixels. However, I'm currently writing code which will reconstruct world position coordinates, this will also need access to the depth buffer.
Is it legal in Opengl 3.3 to bind a depth attachment to a texture unit and sample it whilst also using it for depth testing in the same pass?
I can't find anything specifically about it in the docs but my gut tells me that using the same buffer/texture for two different purposes will produce undefined behaviour. Does anybody know for sure? I have a limited set of hardware to test on and don't want to make false assumptions about what works.
At the very least this creates a situation where memory coherency cannot be guaranteed (coherency is something you sort of assume at all stages in the traditional pipeline pre-GL4 and have no standardized control over either).
The driver just might cache this memory in an undesirable way since this behavior is undefined. You would like to think that an appropriate combination of writemask and sampling would be a strong hint not to do that, but that is all up to whoever designed the driver and your results will tend to vary by hardware vendor, platform and hardware generation.
This scenario is a use-case for things like NV's texture barrier extension, but that is vendor specific and still does not tackle the problem entirely. If you want to do this sort of thing portably, your best bet is to promote the engine to GL4 and use standardized features for early fragment tests, barriers, etc.
Does your composite pass really need a depth buffer in the first place though? It sounds like you want to re-construct per-pixel position during lighting from the stored depth buffer. That's entirely possible in a framebuffer with no depth attachment at all.
Your G-Buffers will already be filled at this point, and after that you no longer need to do any fragment tests. The one fragment that passed all previous tests is what's eventually written to the G-Buffer and there's no reason to apply any additional tests to it when it comes time to do lighting.

Get results of GPU calculations back to the CPU program in OpenGL

Is there a way to get results from a shader running on a GPU back to the program running on the CPU?
I want to generate a polygon mesh from simple voxel data based on a computational costly algorithm on the GPU but I need the result on the CPU for physics calculations.
Define "the results"?
In general, if you're doing GPGPU-style computations with OpenGL, you are going to need to structure your shaders around the needs of a rendering system. Rendering systems are designed to be one-way: data goes into them and an image is produced. Going backwards, having the rendering system produce data, is not generally how rendering systems are structured.
That doesn't mean you can't do it, of course. But you need to architect everything around the limitations of OpenGL.
OpenGL offers a number of hooks where you can write data from certain shader stages. Most of these require specialized hardware
Fragment shader outputs
Any hardware capable of fragment shaders will obviously allow you to write to the current framebuffer you're rendering. Through the use of framebuffer objects and textures with floating-point or integer image formats, you can write pretty much any data you want to a variety of images. Once in a texture, you can simply call glGetTexImage to get the rendered pixel data. Or you can just do glReadPixels to get it if the FBO is still bound. Either way works.
The primary limitations of this method are:
The number of images you can attach to the framebuffer; this limits the amount of data you can write. On pre-GL 3.x hardware, FBOs were typically limited to only 4 images plus a depth/stencil buffer. In 3.x and better hardware, you can expect a minimum of 8 images.
The fact that you're rendering. This means that you need to set up your vertex data to position a triangle exactly where you want it to modify data. This is not a trivial undertaking. It's also difficult to get useful input data, since you typically want each texel to be fairly independent of the other. Structuring your fragment shader around these limitations is difficult. Not impossible, but non-trivial in many cases.
Transform Feedback
This OpenGL 3.0 feature allows the output from the Vertex Processing stage of OpenGL (vertex shader and optional geometry shader) to be captured in one or more buffer objects.
This is much more natural for capturing vertex data that you want to play with or render again. In your case, you'll need to read it back after rendering it, perhaps with a glGetBufferSubData call, or by using glMapBufferRange for reading.
The limitations here are that you generally only can capture 4 output values, where each value is a vec4. There are also some strict layout restrictions. Some OpenGL 3.x and 4.x hardware offers the ability to write data to multiple feedback streams, which can all be written into different buffers.
Image Load/Store
This GL 4.2 feature represents the pinnacle of writing: you can bind an image (a buffer texture, if you want to write to a buffer), and just write to it. There are memory ordering constraints that you need to work within.
It's very flexible, but very complex. Besides the difficulty in using it properly, there are a number of limitations. The number of images you can write to will be fairly limited, perhaps 8 or so. And implementations may have total write limits, so that 8 images to write to may have to be shared by the fragment shader's outputs.
What's more, image outputs are only guaranteed for the fragment shader (and 4.3's compute shaders). That is, hardware is allowed to forbid you from using image load/store on non-FS/CS shader stages.

Is it possible to reuse glsl vertex shader output later?

I have a huge mesh(100k triangles) that needs to be drawn a few times and blend together every frame. Is it possible to reuse the vertex shader output of the first pass of mesh, and skip the vertex stage on later passes? I am hoping to save some cost on the vertex pipeline and rasterization.
Targeted OpenGL 3.0, can use features like transform feedback.
I'll answer your basic question first, then answer your real question.
Yes, you can store the output of vertex transformation for later use. This is called Transform Feedback. It requires OpenGL 3.x-class hardware or better (aka: DX10-hardware).
The way it works is in two stages. First, you have to set your program up to have feedback-based varyings. You do this with glTransformFeedbackVaryings. This must be done before linking the program, in a similar way to things like glBindAttribLocation.
Once that's done, you need to bind buffers (given how you set up your transform feedback varyings) to GL_TRANSFORM_FEEDBACK_BUFFER with glBindBufferRange, thus setting up which buffers the data are written into. Then you start your feedback operation with glBeginTransformFeedback and proceed as normal. You can use a primitive query object to get the number of primitives written (so that you can draw it later with glDrawArrays), or if you have 4.x-class hardware (or AMD 3.x hardware, all of which supports ARB_transform_feedback2), you can render without querying the number of primitives. That would save time.
Now for your actual question: it's probably not going to help buy you any real performance.
You're drawing terrain. And terrain doesn't really get any transformation. Typically you have a matrix multiplication or two, possibly with normals (though if you're rendering for shadow maps, you don't even have that). That's it.
Odds are very good that if you shove 100,000 vertices down the GPU with such a simple shader, you've probably saturated the GPU's ability to render them all. You'll likely bottleneck on primitive assembly/setup, and that's not getting any faster.
So you're probably not going to get much out of this. Feedback is generally used for either generating triangle data for later use (effectively pseudo-compute shaders), or for preserving the results from complex transformations like matrix palette skinning with dual-quaternions and so forth. A simple matrix multiply-and-go will barely be a blip on the radar.
You can try it if you like. But odds are you won't have any problems. Generally, the best solution is to employ some form of deferred rendering, so that you only have to render an object once + X for every shadow it casts (where X is determined by the shadow mapping algorithm). And since shadow maps require different transforms, you wouldn't gain anything from feedback anyway.

Self-Referencing Renderbuffers in OpenGL

I have some OpenGL code that behaves inconsistently across different
hardware. I've got some code that:
Creates a render buffer and binds a texture to its color buffer (Texture A)
Sets this render buffer as active, and adjusts the viewport, etc
Activates a pixel shader (gaussian blur, in this instance).
Draws a quad to full screen, with texture A on it.
Unbinds the renderbuffer, etc.
On my development machine this works fine, and has the intended
effect of blurring the texture "in place", however on other hardware
this does not seem to work.
I've gotten it down to two possibilities.
A) Making a renderbuffer render to itself is not supposed to work, and
only works on my development machine due to some sort of fluke.
Or
B) This approach should work, but something else is going wrong.
Any ideas? Honestly I have had a hard time finding specifics about this issue.
A) is the correct answer. Rendering into the same buffer while reading from it is undefined. It might work, it might not - which is exactly what is happening.
In OpenGL's case, framebuffer_object extension has section "4.4.3 Rendering When an Image of a Bound Texture Object is Also Attached to the Framebuffer" which tells what happens (basically, undefined). In Direct3D9, the debug runtime complains loudly if you use that setup (but it might work depending on hardware/driver). In D3D10 the runtime always unbinds the target that is used as destination, I think.
Why this is undefined? One of the reasons GPUs are so fast is that they can make a lot of assumptions. For example, they can assume that units that fetch pixels do not need to communicate with units that write pixels. So a surface can be read, N cycles later the read is completed, N cycles later the pixel shader ends it's execution, then it it put into some output merge buffers on the GPU, and finally at some point it is written to memory. On top of that, the GPUs rasterize in "undefined" order (one GPU might rasterize in rows, another in some cache-friendly order, another in totally another order), so you don't know which portions of the surface will be written to first.
So what you should do is create several buffers. In blur/glow case, two is usually enough - render into first, then read & blur that while writing into second. Repeat this process if needed in ping-pong way.
In some special cases, even the backbuffer might be enough. You simply don't do a glClear, and what you have drawn previously is still there. The caveat is, of course, that you can't really read from the backbuffer. But for effects like fading in and out, this works.