Under what conditions does a multi-pass approach become strictly necessary? - opengl

I'd like to enumerate those general, fundamental circumstances under which multi-pass rendering becomes an unavoidable necessity, as opposed to keeping everything within the same shader program. Here's what I've come up with so far.
When a result requires non-local fragment information (i.e. context) around the current fragment, e.g. for box filters, then a previous pass must have supplied this;
When a result needs hardware interpolation done by a prior pass;
When a result acts as pre-cache of some set of calculations that enables substantially better performance than simply (re-)working through the entire set of calculations in those passes that use them, e.g. transforming each fragment of the depth buffer in a particular and costly way, which multiple later-pass shaders can then share, rather than each repeating those calculations. So, calculate once, use more than once.
I note from my own (naive) deductions above that vertex and geometry shaders don't really seem to come into the picture of deferred rendering, and so are probably usually done in first pass; to me this seems sensible, but either affirmation or negation of this, with detail, would be of interest.
P.S. I am going to leave this question open to gather good answers, so don't expect quick wins!

Nice topic. For me since I'm a beginner I would say to avoid unnecessary calculations in the pixel/fragment shader you get when you use forward rendering.
With forward rendering you have to do a pass for every light you have in your scene, even if the pixel colors aren't affected.
But that's just a comparison between forward rendering and deferred rendering.
As opposed to keeping everything in the same shader program, the simplest thing I can think of is the fact that you aren't restricted to use N number of lights in your scene, since in for instance GLSL you can use either separate lights or store them in a uniform array. Then again you can also use forward rendering, but if you have a lot of lights in your scene forward rendering has a too expensive pixel/fragment shader.
That's all I really know so I would like to hear other theories as well.

Deferred / multi-pass approaches are used when the results of the depth buffer are needed (produced by rendering basic geometry) in order to produce complex pixel / fragment shading effects based on depth, such as:
Edge / silhouette detection
Lighting
And also application logic:
GPU picking, which requires the depth buffer for ray calculation, and uniquely-coloured / ID'ed geometries in another buffer for identification of "who" was hit.

Related

Is there a faster alternative to geometry shaders that can render points as a specific number of triangles?

I'm currently using openGL with a geometry shader to take points and convert them to triangles during rendering.
I have n lists of points that will each be rendered as n triangles (first list of points each becomes one triangle, second becomes two triangles, etc). I've tried swapping geometry shaders for each of these lists with max_vertices being the minimum for each list. With OpenGL I seemingly have no control over how this is ultimately implemented on the GPU via the geometry shader and some drivers seem to handle it very slowly while others are very fast.
Is there any way to perform this specific task optimally, ideally taking advantage of the fact that I know the exact number of desired output triangles per element and in total? I would be happy to use some alternative to geometry shaders for this if possible. I would also be happy to try Vulkan if it can do the trick.
What you want is arbitrary amplification of geometry: taking one point primitive and producing arbitrarily many entirely separate primitives from it. And the tool GPUs have for that is geometry shaders (or just using a compute shader to generate your vertex data manually, but that's probably not faster and definitely more memory consuming).
While GS's are not known for performance, there is one way you might be able to speed up what you're doing. Since all of the primitives in a particular call will generate a specific number of primitives, you can eschew having each GS output more than one primitive by employing vertex instanced rendering.
Here, you use glDrawArraysInstanced. Your VS needs to pass gl_InstanceID to the GS, which can use that to figure out which triangle to generate from the vertex. That is, instead of having a loop over n to generate n triangles, the GS only generates one triangle. But it gets called instanceCount times, and each call should generate the gl_InstanceIDth triangle.
Now, one downside of this is that the order of triangles generated will be different. In your original GS code, where each GS generates all of the triangles from a point, all of the triangles from one point will be rendered before rendering any triangles from another point. With vertex instancing, you get one triangle from all of the points, then it produces another triangle from all the points, etc. If rendering order matters to you, then this won't work.
If that's important, then you can try geometry shader instancing instead. This works similarly to vertex instancing, except that the instance count is part of the GS. Each GS invocation is only responsible for a single triangle, and you use gl_InvocationID to decide which triangle to use it on. This will ensure that all primitives from one set of GS instances will be rendered before any primitives from a different set of GS instances.
The downside is what I said: the instance count is part of the GS. Unlike instanced rendering, the number of instances is baked into the GS code itself. So you will need a separate program for every count of triangles you work with. SPIR-V specialization constants make it a bit easier on you to build those programs, but you still need to maintain (and swap) multiple programs.
Also, while instanced rendering has no limit on the number of instances, GS's do have a limit. And that limit can be as small as 32 (which is a very popular number).

OpenGL: alpha-to-coverage cross-fade

If using alpha-to-coverage without explicitly setting the samples from the shader (a hardware 4.x feature?), is the coverage mask for alpha value ‘a‘ then guaranteed to be the bit-flip of the coverage mask for alpha value ‘1.f-a‘?
Or in other words: if i render two objects in the same location, and the pixel alphas of the two objects sum up to 1.0, is it then guaranteed that all samples of the pixel get written to (assuming both objects fully cover the pixel)?
The reason why I ask is that I want to crossfade two objects and during the crossfade each object should still properly depth-sort in respect to itself (without interacting with the depth values of the other object and without becoming ‚see-through‘).
If not, how can I realize such a ‚perfect‘ crossfade in a single render pass?
The logic for alpha-to-coverage computation is required to have the same invariance and proportionality guarantees as GL_SAMPLE_COVERAGE (which allows you to specify a floating-point coverage value applied to all fragments in a given rendering command).
However, said guarantees are not exactly specific:
It is intended that the number of 1’s in this value be proportional to the sample coverage value, with all 1’s corresponding to a value of 1.0 and all 0’s corresponding to 0.0.
Note the use of the word "intended" rather than "required". The spec is deliberately super-fuzzy on all of this.
Even the invariance is really fuzzy:
The algorithm can and probably should be different at different pixel locations. If it does differ, it should be defined relative to window, not screen, coordinates, so that rendering results are invariant with respect to window position.
Again, note the word "should". There are no actual requirements here.
So basically, the answer to all of your questions are "the OpenGL specification provides no guarantees for that".
That being said, the general thrust of your question suggests that you're trying to (ab)use multisampling to do cross-fading between two overlapping things without having to do a render-to-texture operation. That's just not going to work well, even if the standard actually guaranteed something about the alpha-to-coverage behavior.
Basically, what you're trying to do is multisample-based dither-based transparency. But like with standard dithering methods, the quality is based entirely on the number of samples. A 16x multisample buffer (which is a huge amount of multisampling) would only give you an effective 16 levels of cross-fade. This would make any kind of animated fading effect not smooth at all.
And the cost of doing 16x multisampling is going to be substantially greater than the cost of doing render-to-texture cross-fading. Both in terms of rendering time and memory overhead (16x multisample buffers are gigantic).
If not, how can I realize such a ‚perfect‘ crossfade in a single render pass?
You can't; not in the general case. Rasterizers accumulate values, with new pixels doing math against the accumulated value of all of the prior values. You want to have an operation do math against a specific previous operation, then combine those results and blend against the rest of the previous operations.
That's simply not the kind of math a rasterizer does.

How to send my model matrix only once per model to shaders

For reference, I'm following this tutorial. Now suppose I have a little application with multiple types of model, if I understand correctly I have to send my MPV matrix from the CPU to the GPU (in other words to my vertex shader) for each model, because each model might have a different model matrix from one to another.
Now looking at the tutorial and this post, I understand that the call to send the matrix to my shader (glUniformMatrix4fv(myMatrixID, 1, GL_FALSE, &myModelMVP[0][0])) should be done for each frame and for each model since each time it overwrites the previous value of my MVP (the one for my last model). But, being concerned about the performance of my app, I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
I'm thinking about having an uniform for each model's MVP matrix, but I think it is not scalable and I would also have to update all of them if my view or projection matrices changed... Is there a way to avoid sending multiple times my model matrices and only send my view and projection matrices upon change?
These are essentially two questions: how to avoid sending data when only part of a transformation sequence changes, and how to efficiently supply per-model data which may or may not have changed since the last frame.
Transformation Sequence
For the first, you have a transformation sequence. Your positions are in model space. You then conceptually transform them into world space, then to camera/view space, then finally to clip space, where you write the position to gl_Position.
Most of these transformations are constant throughout a frame, but may change on a frame-to-frame basis. So you want to avoid changing data that doesn't strictly need to be changed.
If you want to do this, then clearly you cannot provide an "MVP" matrix. That is, you should not have a single matrix that contains the whole transformation. You should instead have a matrix that represents particular parts of the transformation.
However, you will need to do this decomposition for reasons other than performance. You cannot do many lighting operations in clip-space; as a non-linear space, it messes up lots of lighting operations. Therefore, if you're going to do lighting at all, you need a transformation that stops before clip space.
Camera/view space is the most common stopping point for lighting computations.
Now, if you use model-to-camera and camera-to-clip, then the model-to-camera matrix for every model will change when the camera changes, even if the model itself has not moved. And therefore, you may need to upload a bunch of matrices that don't strictly need to be changed.
To avoid that, you would need to use model-to-world and world-to-clip (in this case, you do your lighting in world space). The issue here is that you are exposed to the perils of world space Numerical precision may become problematic.
But is there a genuine performance issue here? Obviously it somewhat depends on the hardware. However, consider that many applications have hundreds if not thousands of objects, each with matrices that change every frame. An animated character usually has over a hundred matrices just for themselves that change every frame.
So it seems unlikely that the performance cost of uploading a few matrices that could have been constant is a real-world problem.
Per-Object Storage
What you really want is to separate your storage of per-object data from the program object itself. This can be done with UBOs or SSBOs; in both cases, you're storing uniform data in buffer objects.
The former are typically smaller in size (64KB or so), while the latter are essentially unbounded in their storage (16MB minimum). Obviously, the former are typically faster to access, but SSBOs shouldn't be considered to be slow.
Each object would have a section of the buffer that gets used for per-object data. And thus, you could choose to change it or not as you see fit.
Even so, such a system does not guarantee faster performance. For example, if the implementation is still reading from that buffer from last frame when you try to change it this frame, the implementation will have to either allocate new memory or just wait until the GPU is finished. This is not a hypothetical possibility; GPU rendering for complex scenes frequently lags a frame behind the CPU.
So to avoid that, you would need to double-buffer your per-object data. But when you do that, you will have to always upload their data, even if it doesn't change. Why? Because it might have changed two frames ago, and your double buffer has old data in it.
Basically, your goal of trying to avoid uploading of sometimes-static per-model data is just as likely to harm performance as to help it.
First of all, it's likely that at least something in your scene moves. If it is the objects then the model matrix will change from frame to frame, if it is the camera then the view or projection matrix will change. MVP includes the composition of the three, so it actually will change anyways and you can't get away from updating it in one way or the other.
However, you may still benefit from employing some of these:
Use Uniform Buffer Objects. You can send the uniforms to the GPU only once, and then rebind the buffer that the program will read the uniforms from. So different models may use different UBOs for their parameters (like model matrix).
Use Instancing. Even if you render only one instance of every model, you can pass the model matrix as an instanced vertex attribute. It will be stored in the VAO, and so sent to the GPU only once (or when you have to update it). On the plus side you may now easily render multiple instances of the same model through instanced draw calls.
Note it might be beneficial to separate the model, view and projection matrices. View and projection might be passed through a 'camera description' uniform buffer object updated only once per frame, then referenced by all programs. Model matrix, if it isn't changed, then will be constant within the VAO. To do proper lighting you have to separate model-view from projection anyways. It might look intimidating to work with three matrices on the GPU, but you actually don't have to, as you may switch to quaternion-based pipeline instead, which in turn simplifies such things like tangent space interpolation.
Two words: Premature Optimization!
I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
The amount of data transmitted is insignificant. A 4×4 matrix of single precision floats takes up 64 bytes. For all intents and purposes this is practically nothing. Heck it takes more data to issue the actual drawing commands to the GPU (and usually uniform value changes are packed into the same bus transaction as the drawing commands).
I'm thinking about having an uniform for each model's MVP matrix
Then you're going to run out of uniforms. There's only so many uniform locations a GPU is required to support. You could of course use a uniform buffer object, but that's hardly the right application for that.

How should I organize shader system with opengl

I was thinking about:
Having a main shader which will be applied to every object of my application, it will be used for projection, transformation, positionning, coloring, etc..
And each object could have their own extra shader for extra stuff, for example a water object definitely needs an extra shader.
But there is a problem, how would I apply 2 or more shaders into one object ? Because I'll need to apply the main shader + object's own shader.
It would be really nice if OpenGL (or Direct3D!) allowed you to have multiple shaders at each vertex / fragment / whatever stage, but alas we are stuck with existing systems.
Assume you've written a bunch of GLSL functions. Some are general-purpose for all objects, like applying the modelview transformation and copying texture coords to the next stage. Some are specific to particular classes of object, such as water or rock.
What you then write is the ubershader, a program in which the main() functions at the vertex / fragment / whatever stages do nothing much other than call all these functions. This is a template or prototype from which you generate more specialised programs.
The most common way is to use the preprocessor and lots of #ifdefs around function calls inside main(). Maybe if you compile without any #defines you get the standard transform and Gouraud shading. Add in #define WATER to get the water effect, #define DISTORT for some kind of free form deformation algorithm, both if you want free-form deformed water, #define FOG to add in a fog effect, ...
You don't even need to have more than one copy of the ubershader source, since you can generate the #define strings at runtime and pass them into glCompileShader.
What you end up with is a lot of shader programs, one for each type of rendering. If for any reasons you'd rather have just one program throughout, you can do something similar on newer systems with GLSL subroutines.
These are basically function pointers in GLSL which you can set much like uniforms. Now your ubershader has 1, 2, ... function pointer calls in the main() functions. Your program just sets up #1 to be standard transform, #2 to be rock/water/whatever, #3 to be fog, ... If you don't want to use a stage, just have a NOP function that you can assign.
While this has the advantage of only using one program, it is not as flexible as the #define approach because any given pointer has to use the same function prototype. It's also more work if say WATER needs processing in multiple shaders, because you have to remember to set the function pointers in every one rather than just a single #define.
Hope this helps.

Does the order of the pixels drawn depend on the indices in glDrawElements?

I'm drawing several alpha-blended triangles that overlap with a single glDrawElements call.
The indices list the triangles back to front and this order is important for the correct visualization.
Can I rely on the result of this operation being exactly the same as when drawing the triangles in the same order with distinct draw calls?
I'm asking this because I'm not sure whether some hardware would make some kind of an optimization and use the indices only for the information about the primitives that are drawn and disregard the actual primitive order.
To second GuyRT's answer, I looked through the GL4.4 core spec:
glDrawElements is described as follows (emphasis mine):
This command constructs a sequence of geometric primitives by
successively transferring elements for count vertices to the GL.
In section 2.1, on can find the following statement (emphasis mine):
Commands are always processed in the order in which they are received,
[...] This means, for example, that one primitive must be drawn
completely before any subsequent one can affect the framebuffer.
One might read this as only valid for primitves rendered through different draw calls (commands), however, in 7.12.1, there is some further confirmation for the more general interpretation reading for that statement (again, my emphasis):
The relative order of invocations of the same shader type are
undefined. A store issued by a shader when working on primitive B
might complete prior to a store for primitive A, even if primitive A
is specified prior to primitive B. This applies even to fragment
shaders; while fragment shader outputs are written to the framebuffer
in primitive order, stores executed by fragment shader invocations are
not.
Yes, you can rely on the order being the same as specified in the index array, and that fragments will be correctly blended with the results of triangles specified earlier in the array.
I cannot find a reference for this, but my UI rendering code relies on this behaviour (and I think it is a common technique).
To my knowledge OpenGL makes no statement about the order of triangles rendered within a single draw call of any kind. It would be counterproductive of it to do so, because it would place undesirable constraints on implementations.
Consider that modern rendering hardware is almost always multi-processor, so the individual triangles from a draw call are almost certainly being rendered in parallel. If you need to render in a particular order for alpha blending purposes, you need to break up your geometry. Alternatively you could investigate the variety of order independent transparency algorithms out there.