Is specifying EndStreamPrimitive() necessary in Geometry shader with streams

Is specifying EndStreamPrimitive() necessary in Geometry shader with streams - opengl

EndStreamPrimitive() can only be used in case of Geometry shader with streams.
Geometry shader with streams can only emit GL_POINTS.
But In GL_POINTS, each vertex itself is a primitive.
So what is the point of having a function like EndStreamPrimitive()?
Just specifying EmitStreamVertex() when primitive type = GL_POINT means end of primitive.
My next question is What is max_vertices in a Geometry shader?
layout(points, max_vertices = 6) out;
I suppose it is the maximum number of vertices a Geometry shader will emit (irrespective of weather it is using streams or not).
If I have 2 streams in my Geometry shader, and I emit 2 vertices to stream 0, 3 vertices to stream 1. should the value of max_vertices be set to 5?

As far as I know, EndStreamPrimitive (...) currently has no use; likely provided simply for consistency with the non-stream GLSL design. If the restriction on points being the only type of output primitive is lifted it may become useful in the future.
To answer your second question, that is the maximum number of vertices you will ever emit in a single invocation of your GS. Geometry Shaders are different from other stages in that the size of their output data can vary at run-time. You could decide because of some condition at run-time that you want to output 8 vertices to stream 1 -- GLSL needs to know an upper-bound and it cannot figure that out if your flow control is based on a variable set elsewhere.
There are implementation limits set on both the number of vertices (GL_MAX_GEOMETRY_OUTPUT_VERTICES) a GS invocation can emit and the sum total number of vector components (GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS). These limits are also defined in the shading language.
Implementations must support the following minimums:
const int gl_MaxGeometryOutputVertices = 256;
const int gl_MaxGeometryTotalOutputComponents = 1024;
If your shader exceeds those limits, you will need to split it into multiple invocations. You can either do that using multiple draw calls or you can have GL4+ automatically invoke your GS multiple times:
layout (invocations = <num_invocations>) in;
You can determine which invocation the GS is performing by the value of gl_InvocationID (GL4+).

Related

Why is my geometry shader becoming "overloaded"?

I use an OpenGL shader to plot graphs. Every span of the graph has the form:
The vertex shader just passes the a's and b's to a geometry shader that then evaluates the curve at max_vertices points.
The problem is that sometimes the geometry shader seems to become "overloaded" and stops spitting out points:
Both curves actually have the exact same values, but for some reason the bottom one has some kind of failure of the geometry shader to generate points.
When I change max_vertices in the following line of my geometry shader:
layout (triangle_strip, max_vertices = ${max_vertices}) out;
from 1024 (the result of gl.glGetInteger(gl.GL_MAX_GEOMETRY_OUTPUT_VERTICES)) to 256, then I get the desired output:
What is happening? What is the true maximum number of vertices? Why is the top graph unaffected, but the bottom one corrupted? They have the same data.

Geometry shaders have two competing sets of limitations. The first is the number of vertices in each GS invocation, and the second is the total number of components you output from a GS invocation (retrieved from GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS). You must stay within both to get defined behavior.
1024 is the required minimum value for GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS; all implementations will support at least that many. If an implementation only supports that number, and you tried to output 1024 vertices, you could only output a single component of data for each such vertex (a single float or int or whatever, not even a vec2).
Overall, it would be better to avoid this problem entirely. What you're trying to do seems like it'd be more easily done via a compute shader or at the very least, geometry shader instancing.

Why do we have to specify the input and output patch size of tessellation shaders separately?

The Khronos wiki for the Tessellation Control Shader states that
The output patch size does not have to match the input patch size.
Why is that? And why do we have to specify the input patch size at all when the control shader is able to change that before the primitive generation gets the patch?
Update
Is the following explanation correct?
The input patch (to the TCS) size is set by glPatchParameter(GL_PATCH_VERTICES, X). This has the consequence that the length of the in attribute arrays is X.
TCS:
in vec4 vs_tc_position[]; // This has a length of X
The output patch size is defined by the TCSs layout (vertices = Y) out;. This means the length of the out attribute arrays is Y.
TCS:
out vec4 tc_te_position[]; // This has a length of Y
The TCS is called Y times and passes the output directly to the TES. So, the in attribute arrays of the TES have a length of Y.
TES:
in vec4 tc_te_position[]; // This has a length of Y
The number of output patch vertices is irrelevant to the Tessellation Primitive Generation (TPG) because it only sees an abstract patch. The number of vertices of the abstract patch is defined by the TESs layout (TYPE) in;.
The TES is called for every new vertex that results from the abstract patch due to the tessellation levels defined by the TCS (when it exists) or by glPatchParameter(GL_PATCH_DEFAULT_{OUTER|INNER}_LEVEL). The TES can then interpolate the attributes based on the gl_TessCoord from the abstract patch and all the vertices (which are more like control points) from the TCS.
Example
So, the following situation could be possible.
glPatchParameteri(GL_PATCH_VERTICES, 1);
The TCS gets one vertex per patch.
layout (vertices = 5) out;
The TCS creates 5 vertices for the output patch. Somehow.
layout (quads) in;
The TPG uses a quad as an abstract patch and subdivides. Then the TES is called on every new vertex and interpolates the attribute of the 5 output vertices from the TCS with the gl_TessCoord from the abstract patch (somehow) to compute the attributes for the new vertices.

The input patch size has to be specified because you don't have to have a TCS at all.
Also, remember that the input patch size is used to interpret the stream of vertices you render with. Every X vertices is a single patch, so OpenGL needs to know what X to use. Even with a TCS, OpenGL needs to have an input size to know how many vertices to pass to the TCS operation.
As for why the input and output patch sizes can be different, that is there to give freedom to the TCS and the user to do whatever they want. TCS's are able to add, remove, or modify data however they want, including adding or removing entire values.
So a TCS can turn a single input vertex into 4 output vertices; this could be useful for something like quad tessellation.

Frequency of shader invocations in rendering commands

Shaders have invocations, which each are (usually) given a unique set of input data, and each (usually) write to their own separate output data. When you issue a rendering command, how many times does each shader get invoked?

Each shader stage has its own frequency of invocations. I will use the OpenGL terminology, but D3D works the same way (since they're both modelling the same hardware relationships).
Vertex Shaders
These are the second most complicated. They execute once for every input vertex... kinda. If you are using non-indexed rendering, then the ratio is exactly 1:1. Every input vertex will execute on a separate vertex shader instance.
If you are using indexed rendering, then it gets complicated. It's more-or-less 1:1, each vertex having its own VS invocation. However, thanks to post-T&L caching, it is possible for a vertex shader to be executed less than once per input vertex.
See, a vertex shader's execution is assumed to create a 1:1 mapping between input vertex data and output vertex data. This means if you pass identical input data to a vertex shader (in the same rendering command), your VS is expected to generate identical output data. So if the hardware can detect that it is about to execute a vertex shader on the same input data that it has used previously, it can skip that execution and simply use the outputs from the previous execution. Assuming it has those values lying around, such as in a cache.
Hardware detects this by using the vertex's index (which is why it doesn't work for non-indexed rendering). If the same index is provided to a vertex shader, it is assumed that the shader will get all of the same input values, and therefore will generate the same output values. So the hardware will cache output values based on indices. If an index is in the post-T&L cache, then the hardware will skip the VS's execution and just use the output values.
Instancing only slightly complicates post-T&L caching. Rather than caching solely on the vertex index, it caches based on the index and instance ID. So it only uses the cached data if both values are the same.
So generally, VS's execute once for every vertex, but if you optimize your geometry with indexed data, it can execute fewer times. Sometimes much fewer, depending on how you do it.
Tessellation Control Shaders
Or Hull Shaders in D3D parlance.
The TCS is very simple in this regard. It will execute exactly once for each vertex in each patch of the rendering command. No caching or other optimizations are done here.
Tessellation Evaluation Shaders
Or Domain Shaders in D3D parlance.
The TES executes after the tessellation primitive generator has generated new vertices. Because of that, how frequently it executes will obviously depend on your tessellation parameters.
The TES takes vertices generated by the tessellator and outputs vertices. It does so in a 1:1 ratio.
But similar to Vertex Shaders, it is not necessarly 1:1 for each vertex in each of the output primitives. Like a VS, the TES is assumed to provide a direct 1:1 mapping between locations in the tessellated primitives and output parameters. So if you invoke a TES multiple times with the same patch location, it is expected to output the same value.
As such, if generated primitives share vertices, the TES will often only be invoked once for such shared vertices. Unlike vertex shaders, you have no control over how much the hardware will utilize this. The best you can do is hope that the generation algorithm is smart enough to minimize how often it calls the TES.
Geometry Shaders
A Geometry Shader will be invoked once for each point, line or triangle primitive, either directly given by the rendering command or generated by the tessellator. So if you render 6 vertices as unconnected lines, your GS will be invoked exactly 3 times.
Each GS invocation can generate zero or more primitives as output.
The GS can use instancing internally (in OpenGL 4.0 or Direct3D 11). This means that, for each primitive that reaches the GS, the GS will be invoked X times, where X is the number of GS instances. Each such invocation will get the same input primitive data (with a special input value used to distinguish between such instances). This is useful for more efficiently directing primitives to different layers of layered framebuffers.
Fragment Shaders
Or Pixel Shaders in D3D parlance. Even though they aren't pixels yet, may not become pixels, and they can be executed multiple times for the same pixel ;)
These are the most complicated with regard to invocation frequency. How often they execute depends on a lot of things.
FS's must be executed at least once for each pixel-sized area that a primitive rasterizes to. But they may be executed more than that.
In order to compute derivatives for texture functions, one FS invocation will often borrow values from one of its neighboring invocation. This is problematic if there is no such invocation, if a neighbor falls outside of the boundary of the primitive being rasterized.
In such cases, there will still be a neighboring FS invocation. Even though it produces no actual data, it still exists and still does work. The good part is that these helper invocations don't hurt performance. They're basically using up shader resources that would have otherwise gone unusued. Also, any attempt by such helper invocations to actually output data will be ignored by the system.
But they do still technically exist.
A less transparent issue revolves around multisampling. See, multisampling implementations (particularly in OpenGL) are allowed to decide on their own how many FS invocations to issue. While there are ways to force multisampled rendering to create an FS invocation for every sample, there is no guarantee that implementations will execute the FS only once per covered pixel outside of these cases.
For example, if I recall correctly, if you create a multisample image with a high sample count on certain NVIDIA hardware (8 to 16 or something like that), then the hardware may decide to execute the FS multiple times. Not necessarily once per sample, but once for every 4 samples or so.
So how many FS invocations do you get? At least one for every pixel-sized area covered by the primitive being rasterized. Possibly more if you're doing multisampled rendering.
Compute Shaders
The exact number of invocations that you specify. That is, the number of work groups you dispatch * the number of invocations per group specified by your CS (your local group count). No more, no less.

How vertex and fragment shaders communicate in OpenGL?

I really do not understand how fragment shader works.
I know that
vertex shader runs once per vertices
fragment shader runs once per fragment
Since fragment shader does not work per vertex but per fragment how can it send data to the fragment shader? The amount of vertices and amount of fragments are not equal.
How can it decide which fragment belong to which vertex?

To make sense of this, you'll need to consider the whole render pipeline. The outputs of the vertex shader (besides the special output gl_Position) is passed along as "associated data" of the vertex to the next stages in the pipeline.
While the vertex shader works on a single vertex at a time, not caring about primitives at all, further stages of the pipeline do take the primitive type (and the vertex connectivity info) into account. That's what typically called "primitive assembly". Now, we still have the single vertices with the associated data produced by the VS, but we also know which vertices are grouped together to define a basic primitive like a point (1 vertex), a line (2 vertices) or a triangle (3 vertices).
During rasterization, fragments are generated for every pixel location in the output pixel raster which belongs to the primitive. In doing so, the associated data of the vertices defining the primitve can be interpolated across the whole primitve. In a line, this is rather simple: a linear interpolation is done. Let's call the endpoints A and B with each some associated output vector v, so that we have v_A and v_B. Across the line, we get the interpolated value for v as v(x)=(1-x) * v_A + x * v_B at each endpoint, where x is in the range of 0 (at point A) to 1 (at point B). For a triangle, barycentric interpolation between the data of all 3 vertices is used. So while there is no 1:1 mapping between vertices and fragments, the outputs of the VS still define the values of the corrseponding input of the FS, just not in a direct way, but indirectly by the interpolation across the primitive type used.
The formula I have given so far are a bit simplified. Actually, by default, a perspective correction is applied, effectively by modifying the formula in such a way that the distortion effects of the perspective are taken into account. This simply means that the interpolation should act as it is applied linearily in object space (before the distortion by the projection was applied). For example, if you have a perspective projection and some primitive which is not parallel to the image plane, going 1 pixel to the right in screen space does mean moving a variable distance on the real object, depending on the distance of the actual point to the camera plane.
You can disable the perspective correction by using the noperspective qualifier for the in/out variables in GLSL. Then, the linear/barycentric interpolation is used as I described it.
You can also use the flat qualifier, which will disable the interpolation entirely. In that case, the value of just one vertex (the so called "provoking vertex") is used for all fragments of the whole primitive. Integer data can never by automatically interpolated by the GL and has to be qualified as flat when sent to the fragment shader.

The answer is that they don't -- at least not directly. There's an additional thing called "the rasterizer" that sits between the vertex processor and the fragment processor in the pipeline. The rasterizer is responsible for collecting the vertexes that come out of the vertex shader, reassembling them into primitives (usually triangles), breaking up those triangles into "rasters" of (partially) coverer pixels, and sending these fragments to the fragment shader.
This is a (mostly) fixed-function piece of hardware that you don't program directly. There are some configuration tweaks you can do that affects what it treats as a primitive and what it produces as fragments, but for the most part its just there between the vertex shader and fragment shader doing its thing.

Atomic counter anomalies in Geometry shader

I am trying to control behavior of fragment shader by calculating vertex count in geometry shader so that if I have a vertex stream of 1000 triangles ,when the count reaches 500 I set some varying for fragment shader which signals that the later must switch its processing.To count total vertices(or triangles) processed I use Atomic counter in geometry shader.I planned to do it in vertex shader first,but then I read somewhere that because of vertex caching counter won't increment on each vertex invocation.But now it seems that doing it in geometry shader doesn't execute the count precisely either.
In my geometry shader I am doing this:
layout(triangles) in;
layout (triangle_strip ,max_vertices = 3) out;
layout(binding=0, offset=0) uniform atomic_uint ac;
out flat float isExterior;
void main()
{
memoryBarrier();
uint counter = atomicCounter(ac);
float switcher = 0.0;
if (counter >= exteriorSize)
{
switcher = 2.0;
}
else
{
atomicCounterIncrement(ac);
atomicCounterIncrement(ac);
atomicCounterIncrement(ac);
}
isExterior = switcher;
// here just emitting primitive....
exteriorSize is a uniform holding a number equal to number of vertices in an array.When I read out the value of counter on CPU it never equals to exteriorSize.But it is almost 2 times smaller than it.Is there a vertex caching in geometry stage as well?Or am I doing something wrong?
Basically what I need is to tell fragment shader: "after vertex number X start doing work Y.As lont as vertex number is less than X do work Z" And I can't get that exact X from atomic counter even though I increment it up till it reach that limit.
UPDATE:
I suspect the problem is with atomic writes synchronization.If I set memoryBarrier in different places the counter values change.But I still can't get it return the exact value that equals to exteriorSize.
UPDATE 2:
Well,I didn't figure out the issue with atomic counter synchronization so I did it using indirect draw . Works like a charm.

The geometry shader executes per-primitive (triangle in this case), whereas the vertex shader executes per-vertex, almost. Using glDrawElements allows vertex results to be shared between triangles (e.g. indexing 0,1,2 then 0,2,3 uses 0 and 2 twice: 4 verts, 2 triangles and 6 references). As you say, a limited cache is used to share the results, so if the same vertex is referenced a long time later it has to be recomputed.
It looks like there's a potential issue with updates to the counter occurring between atomicCounter and atomicCounterIncrement. If you want an entire section of code like this to work, it needs to be locked. This can get very slow depending on what you're locking.
Instead, it's going to be far easier to always call atomicCounterIncrement and potentially allow ac to grow beyond exteriorSize.
AFAIK reading back values from the atomic counter buffer should stall until the memory operations have completed, but I've been caught out not calling glMemoryBarrier between passes before.
It sounds like exteriorSize should be equal to the number of triangles and not vertices if this is executing in the geometry shader. If instead you do want per-vertex processing, then maybe change to GL_POINTS or save the vertex shader results using the transform feedback extension and then drawing triangles from that (essentially doing the caching yourself but with a buffer that holds everything). If you use glDrawArrays or never reuse vertices then a standard vertex shader should be fine.
Lastly, calling atomicCounterIncrement three times is a waste. Call once and use counter * 3.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js