OpenGL GLSL: Output from shader to memory - opengl

I need to have some variable/object in the graphics memory that can be accessed within my fragment shader and within my normal C# code. Preferably a 16Byte vec4
What I want to do:
[In C#] Read variable from graphic memory to cpu memory
[In C#] Set variable to zero
[In C#] Execute normal drawing of my scene
[In Shader] One of the fragment passes writes something to the variable (UPDATE)
Restart the loop
(UPDATE)
I pass the current mouse coordinates to the fragment shader with uniform variables. The fragment shader then checks if it is the corresponding pixel. If yes it writes a certain color for colorpicking into the variable. The reason I dont write to a FS output is that I simply didn't find any solution on the internet on how to get this output into my normal memory. Additionaly i would have an output for each pixel instead of one
What I want is basically a uniform variable that a shader can write to.
Is there any kind of variable/object that fits my needs and if so how performant will it be?

A "uniform" that your shader can write to is the wrong term. Uniform means uniform (as in the same value everywhere). If a specific shader invocation is changing the value, it is not uniform anymore.
You can use atomic counters for this sort of thing; increment the counter for every test that passes and later check for non-zero state. This is vastly simpler than setting up a general-purpose Shader Storage Buffer Object and then worrying about making memory access to it coherent.
Occlusion queries are also available for older hardware. They work surprisingly similarly to atomic counters, where you can (very roughly) count the number of fragments that pass a depth test. Do not count on its accuracy, use discard in your fragment shader for any pixel that does not pass your test condition and then test for a non-zero fragment count in the query readback.
As for performance, as long as you can deal with a couple frames worth of latency between issuing a command and later using the result, you should be fine.
If you try to use either, an atomic counter or an occlusion query and read-back the result during the same frame, you will stall the pipeline and eliminate CPU/GPU parallelism.
I would suggest inserting a fence sync object in the command stream and then checking the status of the fence once per-frame before attempting to read results back. This will prevent stalling.

Related

Persistent data in a shader - GLSL

Is it possible to have persistent data within the same shader? Something like a uniform, except that the shader itself can set it.
What I'm trying to do is embed a vertex in my vertex stream that will act as a "state change" and affect an operation on all vertices downstream. I.E. the vertex shader says "ah, I hit this vertex" and turns a boolean on or off... and later vertices can check that boolean. It doesn't need to work between shaders, it just needs to work during a single pump of vertices.
Is this possible? Using GLSL.
embed a vertex in my vertex stream that will act as a "state change" and affect an operation on all vertices downstream.
That's not possible. Invocations of a particular shader stage execute in a largely undefined order with respect to other invocations within that stage (and drawing command). Regardless of the ability to read or write memory, you cannot ensure that "downstream" invocations will not have been executed before the "upstream" invocations wrote to that memory.
If you want to do something like that, you will need to put them into separate drawing commands as well as issue appropriate memory barriers (to make such writes visible to later commands ) between such draws.

Frequency of shader invocations in rendering commands

Shaders have invocations, which each are (usually) given a unique set of input data, and each (usually) write to their own separate output data. When you issue a rendering command, how many times does each shader get invoked?
Each shader stage has its own frequency of invocations. I will use the OpenGL terminology, but D3D works the same way (since they're both modelling the same hardware relationships).
Vertex Shaders
These are the second most complicated. They execute once for every input vertex... kinda. If you are using non-indexed rendering, then the ratio is exactly 1:1. Every input vertex will execute on a separate vertex shader instance.
If you are using indexed rendering, then it gets complicated. It's more-or-less 1:1, each vertex having its own VS invocation. However, thanks to post-T&L caching, it is possible for a vertex shader to be executed less than once per input vertex.
See, a vertex shader's execution is assumed to create a 1:1 mapping between input vertex data and output vertex data. This means if you pass identical input data to a vertex shader (in the same rendering command), your VS is expected to generate identical output data. So if the hardware can detect that it is about to execute a vertex shader on the same input data that it has used previously, it can skip that execution and simply use the outputs from the previous execution. Assuming it has those values lying around, such as in a cache.
Hardware detects this by using the vertex's index (which is why it doesn't work for non-indexed rendering). If the same index is provided to a vertex shader, it is assumed that the shader will get all of the same input values, and therefore will generate the same output values. So the hardware will cache output values based on indices. If an index is in the post-T&L cache, then the hardware will skip the VS's execution and just use the output values.
Instancing only slightly complicates post-T&L caching. Rather than caching solely on the vertex index, it caches based on the index and instance ID. So it only uses the cached data if both values are the same.
So generally, VS's execute once for every vertex, but if you optimize your geometry with indexed data, it can execute fewer times. Sometimes much fewer, depending on how you do it.
Tessellation Control Shaders
Or Hull Shaders in D3D parlance.
The TCS is very simple in this regard. It will execute exactly once for each vertex in each patch of the rendering command. No caching or other optimizations are done here.
Tessellation Evaluation Shaders
Or Domain Shaders in D3D parlance.
The TES executes after the tessellation primitive generator has generated new vertices. Because of that, how frequently it executes will obviously depend on your tessellation parameters.
The TES takes vertices generated by the tessellator and outputs vertices. It does so in a 1:1 ratio.
But similar to Vertex Shaders, it is not necessarly 1:1 for each vertex in each of the output primitives. Like a VS, the TES is assumed to provide a direct 1:1 mapping between locations in the tessellated primitives and output parameters. So if you invoke a TES multiple times with the same patch location, it is expected to output the same value.
As such, if generated primitives share vertices, the TES will often only be invoked once for such shared vertices. Unlike vertex shaders, you have no control over how much the hardware will utilize this. The best you can do is hope that the generation algorithm is smart enough to minimize how often it calls the TES.
Geometry Shaders
A Geometry Shader will be invoked once for each point, line or triangle primitive, either directly given by the rendering command or generated by the tessellator. So if you render 6 vertices as unconnected lines, your GS will be invoked exactly 3 times.
Each GS invocation can generate zero or more primitives as output.
The GS can use instancing internally (in OpenGL 4.0 or Direct3D 11). This means that, for each primitive that reaches the GS, the GS will be invoked X times, where X is the number of GS instances. Each such invocation will get the same input primitive data (with a special input value used to distinguish between such instances). This is useful for more efficiently directing primitives to different layers of layered framebuffers.
Fragment Shaders
Or Pixel Shaders in D3D parlance. Even though they aren't pixels yet, may not become pixels, and they can be executed multiple times for the same pixel ;)
These are the most complicated with regard to invocation frequency. How often they execute depends on a lot of things.
FS's must be executed at least once for each pixel-sized area that a primitive rasterizes to. But they may be executed more than that.
In order to compute derivatives for texture functions, one FS invocation will often borrow values from one of its neighboring invocation. This is problematic if there is no such invocation, if a neighbor falls outside of the boundary of the primitive being rasterized.
In such cases, there will still be a neighboring FS invocation. Even though it produces no actual data, it still exists and still does work. The good part is that these helper invocations don't hurt performance. They're basically using up shader resources that would have otherwise gone unusued. Also, any attempt by such helper invocations to actually output data will be ignored by the system.
But they do still technically exist.
A less transparent issue revolves around multisampling. See, multisampling implementations (particularly in OpenGL) are allowed to decide on their own how many FS invocations to issue. While there are ways to force multisampled rendering to create an FS invocation for every sample, there is no guarantee that implementations will execute the FS only once per covered pixel outside of these cases.
For example, if I recall correctly, if you create a multisample image with a high sample count on certain NVIDIA hardware (8 to 16 or something like that), then the hardware may decide to execute the FS multiple times. Not necessarily once per sample, but once for every 4 samples or so.
So how many FS invocations do you get? At least one for every pixel-sized area covered by the primitive being rasterized. Possibly more if you're doing multisampled rendering.
Compute Shaders
The exact number of invocations that you specify. That is, the number of work groups you dispatch * the number of invocations per group specified by your CS (your local group count). No more, no less.

Tessellation Shaders

I am trying to learn tessellation shaders in openGL 4.1. I understood most of the things. I have one question.
What is gl_InvocationID?
Can any body please explain in some easy way?
gl_InvocationID has two current uses, but it represents the same concept in both.
In Geometry Shaders, you can have GL run your geometry shader multiple times per-primitive. This is useful in scenarios where you want to draw the same thing from several perspectives. Each time the shader runs on the same set of data, gl_InvocationID is incremented.
The common theme between Geometry and Tessellation Shaders is that each invocation shares the same input data. A Tessellation Control Shader can read every single vertex in the input patch primitive, and you actually need gl_InvocationID to make sense of which data point you are supposed to be processing.
This is why you generally see Tessellation Control Shaders written something like this:
gl_out [gl_InvocationID].gl_Position = gl_in [gl_InvocationID].gl_Position;
gl_in and gl_out are potentially very large arrays in Tessellation Control Shaders (equal in size to GL_PATCH_VERTICES), and you have to know which vertex you are interested in.
Also, keep in mind that you are not allowed to write to any index other than gl_out [gl_InvocationID] from a Tessellation Control Shader. That property keeps invoking Tessellation Control Shaders in parallel sane (it avoids order dependencies and prevents overwriting data that a different invocation already wrote).

How Many Shader Programs Do I Really Need?

Let's say I have a shader set up to use 3 textures, and that I need to render some polygon that needs all the same shader attributes except that it requires only 1 texture. I have noticed on my own graphics card that I can simply call glDisableVertexAttrib() to disable the other two textures, and that doing so apparently causes the disabled texture data received by the fragment shader to be all white (1.0f). In other words, if I have a fragment shader instruction (pseudo-code)...
final_red = tex0.red * tex1.red * tex2.red
...the operation produces the desired final value regardless whether I have 1, 2, or 3 textures enabled. From this comes a number of questions:
Is it legit to disable expected textures like this, or is it a coincidence that my particular graphics card has this apparent mathematical safeguard?
Is the "best practice" to create a separate shader program that only expects a single texture for single texture rendering?
If either approach is valid, is there a benefit to creating a second shader program? I'm thinking it would be cost less time to make 2 glDisableVertexAttrib() calls than to make a glUseProgram() + 5-6 glGetUniform() calls, but maybe #4 addresses that issue.
When changing the active shader program with glUseProgram() do I need to call glGetUniform... functions every time to re-establish the location of each uniform in the program, or is the location of each expected to be consistent until the shader program is deallocated?
Disabling vertex attributes would not really disable your textures, it would just give you undefined texture coordinates. That might produce an affect similar to disabling a certain texture, but to do this properly you should use a uniform or possibly subroutines (if you have dozens of variations of the same shader).
As far as time taken to disable a vertex array state, that's probably going to be slower than changing a uniform value. Setting uniform values don't really affect the render pipeline state, they're just small changes to memory. Likewise, constantly swapping the current GLSL program does things like invalidate shader cache, so that's also significantly more expensive than setting a uniform value.
If you're on a modern GL implementation (GL 4.1+ or one that implements GL_ARB_separate_shader_objects) you can even set uniform values without binding a GLSL program at all, simply by calling glProgramUniform* (...)
I am most concerned with the fact that you think you need to call glGetUniformLocation (...) each time you set a uniform's value. The only time the location of a uniform in a GLSL program changes is when you link it. Assuming you don't constantly re-link your GLSL program, you only need to query those locations once and store them persistently.

Is it possible to write a bunch of pixels in gl_FragColor?

Has anyone familiar with some sort of OpenGL magic to get rid of calculating bunch of pixels in fragment shader instead of only 1? Especially this issue is hot for OpenGL ES in fact meanwile flaws mobile platforms and necessary of doing things in more accurate (in performance meaning) way on it.
Are any conclusions or ideas out there?
P.S. it's known shader due to GPU architecture organisation is run in parallel for each texture monad. But maybe there techniques to raise it from one pixel to a group of ones or to implement your own glTexture organisation. A lot of work could be done faster this way within GPU.
OpenGL does not support writing to multiple fragments (meaning with distinct coordinates) in a shader, for good reason, it would obstruct the GPUs ability to compute each fragment in parallel, which is its greatest strength.
The structure of shaders may appear weird at first because an entire program is written for only one vertex or fragment. You might wonder why can't you "see" what is going on in neighboring parts?
The reason is an instance of the shader program runs for each output fragment, on each core/thread simultaneously, so they must all be independent of one another.
Parallel, independent, processing allows GPUs to render quickly, because the total time to process a batch of pixels is only as long as the single most intensive pixel.
Adding outputs with differing coordinates greatly complicates this.
Suppose a single fragment was written to by two or more instances of a shader.
To ensure correct results, the GPU can either assign one to be an authority and ignore the other (how does it know which will write?)
Or you can add a mutex, and have one wait around for the other to finish.
The other option is to allow a race condition regarding whichever one finishes first.
Either way this would immensely slows down the process, make the shaders ugly, and introduces incorrect and unpredictable behaviour.
Well firstly you can calculate multiple outputs from a single fragment shader in OpenGL 3 and up. A framebuffer object can have more than one RGBA surfaces (Renderbuffer Objects) attached and generate an RGBA for each of them by using gl_FragData[n] instead of gl_FragColor. See chapter 8 of the 5th edition OpenGL SuperBible.
However, the multiple outputs can only be generated for the same X,Y pixel coordinates in each buffer. This is for the same reason that an older style fragment shader can only generate one output, and can't change gl_FragCoord. OpenGL guarantees that in rendering any primitive, one and only one fragment shader will write to any X,Y pixel in the destination framebuffer(s).
If a fragment shader could generate multiple pixel values at different X,Y coords, it might try to write to the same destination pixel as another execution of the same fragment shader. Same if the fragment shader could change the pixel X or Y. This is the classic multiple threads trying to update shared memory problem.
One way to solve it would be to say "if this happens, the results are unpredictable" which sucks from the programmer point of view because it's completely out of your control. Or fragment shaders would have to lock the pixels they are updating, which would make GPUs far more complicated and expensive, and the performance would suck. Or fragment shaders would execute in some defined order (eg top left to bottom right) instead of in parallel, which wouldn't need locks but the performance would suck even more.