Is there a way to calculate dFdx(dFdx()) of something within a fragment shader? - glsl

So I already know that the documentation for dFdx, dFdy, and fwidth states that "expressions that imply higher-order derivatives such as dFdx(dFdx(n)) have undefined results, as do mixed-order derivatives such as dFdx(dFdy(n))." If such expressions are undefined, is it possible to get higher-order derivatives of some expression within a fragment shader?
I hear that dFdx gets information from neighboring fragments and finds the difference between the neighbor's values and this fragment's values. Perhaps there is a way to manually take information from neighboring fragments?
I think there is a formula that can be used to find the second-order derivative:
(f(x+h,y+h) - f(x+h,y) - f(x,y+h) + f(x,y))/h^2
But my question is, how do we get terms f(x+h,y+h), f(x+h,y), f(x,y+h)? How do we also get h, which is the distance between fragments?

Perhaps there is a way to manually take information from neighboring fragments?
Even if you could (and with some subgroup extensions, you can), it wouldn't help.
Fragment shaders execute invocations in 2x2 quads, with groups of 4 invocations that are directly adjacent to each other. The derivative functions merely take the difference between data in the horizontal/vertical fragments in the quad. If one or more of the fragments in a quad happens to be outside of the area of the primitive being rasterized, it still gets executed (in order to compute derivatives), but it has no visible effects. These are called "helper invocations.
Regardless, invocations in a quad can only talk to other invocations in the same quad. And if you wanted to get higher order derivatives, you would need to sample from more than just a single adjacent fragment.


Is there a faster alternative to geometry shaders that can render points as a specific number of triangles?

I'm currently using openGL with a geometry shader to take points and convert them to triangles during rendering.
I have n lists of points that will each be rendered as n triangles (first list of points each becomes one triangle, second becomes two triangles, etc). I've tried swapping geometry shaders for each of these lists with max_vertices being the minimum for each list. With OpenGL I seemingly have no control over how this is ultimately implemented on the GPU via the geometry shader and some drivers seem to handle it very slowly while others are very fast.
Is there any way to perform this specific task optimally, ideally taking advantage of the fact that I know the exact number of desired output triangles per element and in total? I would be happy to use some alternative to geometry shaders for this if possible. I would also be happy to try Vulkan if it can do the trick.
What you want is arbitrary amplification of geometry: taking one point primitive and producing arbitrarily many entirely separate primitives from it. And the tool GPUs have for that is geometry shaders (or just using a compute shader to generate your vertex data manually, but that's probably not faster and definitely more memory consuming).
While GS's are not known for performance, there is one way you might be able to speed up what you're doing. Since all of the primitives in a particular call will generate a specific number of primitives, you can eschew having each GS output more than one primitive by employing vertex instanced rendering.
Here, you use glDrawArraysInstanced. Your VS needs to pass gl_InstanceID to the GS, which can use that to figure out which triangle to generate from the vertex. That is, instead of having a loop over n to generate n triangles, the GS only generates one triangle. But it gets called instanceCount times, and each call should generate the gl_InstanceIDth triangle.
Now, one downside of this is that the order of triangles generated will be different. In your original GS code, where each GS generates all of the triangles from a point, all of the triangles from one point will be rendered before rendering any triangles from another point. With vertex instancing, you get one triangle from all of the points, then it produces another triangle from all the points, etc. If rendering order matters to you, then this won't work.
If that's important, then you can try geometry shader instancing instead. This works similarly to vertex instancing, except that the instance count is part of the GS. Each GS invocation is only responsible for a single triangle, and you use gl_InvocationID to decide which triangle to use it on. This will ensure that all primitives from one set of GS instances will be rendered before any primitives from a different set of GS instances.
The downside is what I said: the instance count is part of the GS. Unlike instanced rendering, the number of instances is baked into the GS code itself. So you will need a separate program for every count of triangles you work with. SPIR-V specialization constants make it a bit easier on you to build those programs, but you still need to maintain (and swap) multiple programs.
Also, while instanced rendering has no limit on the number of instances, GS's do have a limit. And that limit can be as small as 32 (which is a very popular number).

OpenGL: alpha-to-coverage cross-fade

If using alpha-to-coverage without explicitly setting the samples from the shader (a hardware 4.x feature?), is the coverage mask for alpha value ‘a‘ then guaranteed to be the bit-flip of the coverage mask for alpha value ‘1.f-a‘?
Or in other words: if i render two objects in the same location, and the pixel alphas of the two objects sum up to 1.0, is it then guaranteed that all samples of the pixel get written to (assuming both objects fully cover the pixel)?
The reason why I ask is that I want to crossfade two objects and during the crossfade each object should still properly depth-sort in respect to itself (without interacting with the depth values of the other object and without becoming ‚see-through‘).
If not, how can I realize such a ‚perfect‘ crossfade in a single render pass?
The logic for alpha-to-coverage computation is required to have the same invariance and proportionality guarantees as GL_SAMPLE_COVERAGE (which allows you to specify a floating-point coverage value applied to all fragments in a given rendering command).
However, said guarantees are not exactly specific:
It is intended that the number of 1’s in this value be proportional to the sample coverage value, with all 1’s corresponding to a value of 1.0 and all 0’s corresponding to 0.0.
Note the use of the word "intended" rather than "required". The spec is deliberately super-fuzzy on all of this.
Even the invariance is really fuzzy:
The algorithm can and probably should be different at different pixel locations. If it does differ, it should be defined relative to window, not screen, coordinates, so that rendering results are invariant with respect to window position.
Again, note the word "should". There are no actual requirements here.
So basically, the answer to all of your questions are "the OpenGL specification provides no guarantees for that".
That being said, the general thrust of your question suggests that you're trying to (ab)use multisampling to do cross-fading between two overlapping things without having to do a render-to-texture operation. That's just not going to work well, even if the standard actually guaranteed something about the alpha-to-coverage behavior.
Basically, what you're trying to do is multisample-based dither-based transparency. But like with standard dithering methods, the quality is based entirely on the number of samples. A 16x multisample buffer (which is a huge amount of multisampling) would only give you an effective 16 levels of cross-fade. This would make any kind of animated fading effect not smooth at all.
And the cost of doing 16x multisampling is going to be substantially greater than the cost of doing render-to-texture cross-fading. Both in terms of rendering time and memory overhead (16x multisample buffers are gigantic).
If not, how can I realize such a ‚perfect‘ crossfade in a single render pass?
You can't; not in the general case. Rasterizers accumulate values, with new pixels doing math against the accumulated value of all of the prior values. You want to have an operation do math against a specific previous operation, then combine those results and blend against the rest of the previous operations.
That's simply not the kind of math a rasterizer does.

Under what conditions does a multi-pass approach become strictly necessary?

I'd like to enumerate those general, fundamental circumstances under which multi-pass rendering becomes an unavoidable necessity, as opposed to keeping everything within the same shader program. Here's what I've come up with so far.
When a result requires non-local fragment information (i.e. context) around the current fragment, e.g. for box filters, then a previous pass must have supplied this;
When a result needs hardware interpolation done by a prior pass;
When a result acts as pre-cache of some set of calculations that enables substantially better performance than simply (re-)working through the entire set of calculations in those passes that use them, e.g. transforming each fragment of the depth buffer in a particular and costly way, which multiple later-pass shaders can then share, rather than each repeating those calculations. So, calculate once, use more than once.
I note from my own (naive) deductions above that vertex and geometry shaders don't really seem to come into the picture of deferred rendering, and so are probably usually done in first pass; to me this seems sensible, but either affirmation or negation of this, with detail, would be of interest.
P.S. I am going to leave this question open to gather good answers, so don't expect quick wins!
Nice topic. For me since I'm a beginner I would say to avoid unnecessary calculations in the pixel/fragment shader you get when you use forward rendering.
With forward rendering you have to do a pass for every light you have in your scene, even if the pixel colors aren't affected.
But that's just a comparison between forward rendering and deferred rendering.
As opposed to keeping everything in the same shader program, the simplest thing I can think of is the fact that you aren't restricted to use N number of lights in your scene, since in for instance GLSL you can use either separate lights or store them in a uniform array. Then again you can also use forward rendering, but if you have a lot of lights in your scene forward rendering has a too expensive pixel/fragment shader.
That's all I really know so I would like to hear other theories as well.
Deferred / multi-pass approaches are used when the results of the depth buffer are needed (produced by rendering basic geometry) in order to produce complex pixel / fragment shading effects based on depth, such as:
Edge / silhouette detection
And also application logic:
GPU picking, which requires the depth buffer for ray calculation, and uniquely-coloured / ID'ed geometries in another buffer for identification of "who" was hit.

Does the order of the pixels drawn depend on the indices in glDrawElements?

I'm drawing several alpha-blended triangles that overlap with a single glDrawElements call.
The indices list the triangles back to front and this order is important for the correct visualization.
Can I rely on the result of this operation being exactly the same as when drawing the triangles in the same order with distinct draw calls?
I'm asking this because I'm not sure whether some hardware would make some kind of an optimization and use the indices only for the information about the primitives that are drawn and disregard the actual primitive order.
To second GuyRT's answer, I looked through the GL4.4 core spec:
glDrawElements is described as follows (emphasis mine):
This command constructs a sequence of geometric primitives by
successively transferring elements for count vertices to the GL.
In section 2.1, on can find the following statement (emphasis mine):
Commands are always processed in the order in which they are received,
[...] This means, for example, that one primitive must be drawn
completely before any subsequent one can affect the framebuffer.
One might read this as only valid for primitves rendered through different draw calls (commands), however, in 7.12.1, there is some further confirmation for the more general interpretation reading for that statement (again, my emphasis):
The relative order of invocations of the same shader type are
undefined. A store issued by a shader when working on primitive B
might complete prior to a store for primitive A, even if primitive A
is specified prior to primitive B. This applies even to fragment
shaders; while fragment shader outputs are written to the framebuffer
in primitive order, stores executed by fragment shader invocations are
Yes, you can rely on the order being the same as specified in the index array, and that fragments will be correctly blended with the results of triangles specified earlier in the array.
I cannot find a reference for this, but my UI rendering code relies on this behaviour (and I think it is a common technique).
To my knowledge OpenGL makes no statement about the order of triangles rendered within a single draw call of any kind. It would be counterproductive of it to do so, because it would place undesirable constraints on implementations.
Consider that modern rendering hardware is almost always multi-processor, so the individual triangles from a draw call are almost certainly being rendered in parallel. If you need to render in a particular order for alpha blending purposes, you need to break up your geometry. Alternatively you could investigate the variety of order independent transparency algorithms out there.

Does OpenGL guarantee that primitives in a vertex buffer will be drawn in order?

I've been reading through the openGL specification trying to find an answer to this question, without luck. I'm trying to figure out if OpenGL guarantees that draw calls such as GLDrawElements or GLDrawArrays will draw elements in precisely the order they appear in the VBO, or if it is free to process the fragments of those primitives in any order.
For example, if I have a vertex buffer with 30 vertices representing 10 triangles, each with the same coordinates. Will it always be the case that the triangle corresponding to vertices 0, 1 and 2 will be rendered first (and therefore on the bottom); and the triangle corresponding to vertices 28, 29, 30 always be rendered last (and therefore on top)?
The specification is very careful to define an order for the rendering of everything. Arrays of vertex data are processed in order, which results in the generation of primitives in a specific order. Each primitive is said to be rasterized in order, and later primitives cannot be rasterized until prior ones have finished.
Of course, this is all how OpenGL says it should behave. Implementations can (and do) cheat by rasterizing and processing multiple primitives at once. However, they will still obey the "as if" rule. So they cheat internally, but will still write the results as if it had all executed sequentially.
So yes, there's a specific order you can rely upon. Unless you're using shaders that perform incoherent memory accesses; then all bets are off for shader writes.
Although they may actually be drawn in a different order and take finish at different times, at the last raster operation pipeline stage, any blending (or depth/stencil/alpha test for that matter) will be done in the order that the triangles were issued.
You can confirm this by rendering some object using a blending equation that doesn't commute, for example:
glBlendFunc(GL_ONE, GL_DST_COLOR);
If the final framebuffer contents were written by the same arbitrary order that the primitives may be drawn, then in such an example you would see an effect that looks similar to Z-fighting.
This is why it's called a fragment shader (as opposed to pixel shader) because it's not a pixel yet since after the fragment stage it doesn't get written to the framebuffer just yet; only after the raster operation stage.