Is it possible to write a bunch of pixels in gl_FragColor? - opengl

Has anyone familiar with some sort of OpenGL magic to get rid of calculating bunch of pixels in fragment shader instead of only 1? Especially this issue is hot for OpenGL ES in fact meanwile flaws mobile platforms and necessary of doing things in more accurate (in performance meaning) way on it.
Are any conclusions or ideas out there?
P.S. it's known shader due to GPU architecture organisation is run in parallel for each texture monad. But maybe there techniques to raise it from one pixel to a group of ones or to implement your own glTexture organisation. A lot of work could be done faster this way within GPU.

OpenGL does not support writing to multiple fragments (meaning with distinct coordinates) in a shader, for good reason, it would obstruct the GPUs ability to compute each fragment in parallel, which is its greatest strength.
The structure of shaders may appear weird at first because an entire program is written for only one vertex or fragment. You might wonder why can't you "see" what is going on in neighboring parts?
The reason is an instance of the shader program runs for each output fragment, on each core/thread simultaneously, so they must all be independent of one another.
Parallel, independent, processing allows GPUs to render quickly, because the total time to process a batch of pixels is only as long as the single most intensive pixel.
Adding outputs with differing coordinates greatly complicates this.
Suppose a single fragment was written to by two or more instances of a shader.
To ensure correct results, the GPU can either assign one to be an authority and ignore the other (how does it know which will write?)
Or you can add a mutex, and have one wait around for the other to finish.
The other option is to allow a race condition regarding whichever one finishes first.
Either way this would immensely slows down the process, make the shaders ugly, and introduces incorrect and unpredictable behaviour.

Well firstly you can calculate multiple outputs from a single fragment shader in OpenGL 3 and up. A framebuffer object can have more than one RGBA surfaces (Renderbuffer Objects) attached and generate an RGBA for each of them by using gl_FragData[n] instead of gl_FragColor. See chapter 8 of the 5th edition OpenGL SuperBible.
However, the multiple outputs can only be generated for the same X,Y pixel coordinates in each buffer. This is for the same reason that an older style fragment shader can only generate one output, and can't change gl_FragCoord. OpenGL guarantees that in rendering any primitive, one and only one fragment shader will write to any X,Y pixel in the destination framebuffer(s).
If a fragment shader could generate multiple pixel values at different X,Y coords, it might try to write to the same destination pixel as another execution of the same fragment shader. Same if the fragment shader could change the pixel X or Y. This is the classic multiple threads trying to update shared memory problem.
One way to solve it would be to say "if this happens, the results are unpredictable" which sucks from the programmer point of view because it's completely out of your control. Or fragment shaders would have to lock the pixels they are updating, which would make GPUs far more complicated and expensive, and the performance would suck. Or fragment shaders would execute in some defined order (eg top left to bottom right) instead of in parallel, which wouldn't need locks but the performance would suck even more.

Related

Frequency of shader invocations in rendering commands

Shaders have invocations, which each are (usually) given a unique set of input data, and each (usually) write to their own separate output data. When you issue a rendering command, how many times does each shader get invoked?
Each shader stage has its own frequency of invocations. I will use the OpenGL terminology, but D3D works the same way (since they're both modelling the same hardware relationships).
Vertex Shaders
These are the second most complicated. They execute once for every input vertex... kinda. If you are using non-indexed rendering, then the ratio is exactly 1:1. Every input vertex will execute on a separate vertex shader instance.
If you are using indexed rendering, then it gets complicated. It's more-or-less 1:1, each vertex having its own VS invocation. However, thanks to post-T&L caching, it is possible for a vertex shader to be executed less than once per input vertex.
See, a vertex shader's execution is assumed to create a 1:1 mapping between input vertex data and output vertex data. This means if you pass identical input data to a vertex shader (in the same rendering command), your VS is expected to generate identical output data. So if the hardware can detect that it is about to execute a vertex shader on the same input data that it has used previously, it can skip that execution and simply use the outputs from the previous execution. Assuming it has those values lying around, such as in a cache.
Hardware detects this by using the vertex's index (which is why it doesn't work for non-indexed rendering). If the same index is provided to a vertex shader, it is assumed that the shader will get all of the same input values, and therefore will generate the same output values. So the hardware will cache output values based on indices. If an index is in the post-T&L cache, then the hardware will skip the VS's execution and just use the output values.
Instancing only slightly complicates post-T&L caching. Rather than caching solely on the vertex index, it caches based on the index and instance ID. So it only uses the cached data if both values are the same.
So generally, VS's execute once for every vertex, but if you optimize your geometry with indexed data, it can execute fewer times. Sometimes much fewer, depending on how you do it.
Tessellation Control Shaders
Or Hull Shaders in D3D parlance.
The TCS is very simple in this regard. It will execute exactly once for each vertex in each patch of the rendering command. No caching or other optimizations are done here.
Tessellation Evaluation Shaders
Or Domain Shaders in D3D parlance.
The TES executes after the tessellation primitive generator has generated new vertices. Because of that, how frequently it executes will obviously depend on your tessellation parameters.
The TES takes vertices generated by the tessellator and outputs vertices. It does so in a 1:1 ratio.
But similar to Vertex Shaders, it is not necessarly 1:1 for each vertex in each of the output primitives. Like a VS, the TES is assumed to provide a direct 1:1 mapping between locations in the tessellated primitives and output parameters. So if you invoke a TES multiple times with the same patch location, it is expected to output the same value.
As such, if generated primitives share vertices, the TES will often only be invoked once for such shared vertices. Unlike vertex shaders, you have no control over how much the hardware will utilize this. The best you can do is hope that the generation algorithm is smart enough to minimize how often it calls the TES.
Geometry Shaders
A Geometry Shader will be invoked once for each point, line or triangle primitive, either directly given by the rendering command or generated by the tessellator. So if you render 6 vertices as unconnected lines, your GS will be invoked exactly 3 times.
Each GS invocation can generate zero or more primitives as output.
The GS can use instancing internally (in OpenGL 4.0 or Direct3D 11). This means that, for each primitive that reaches the GS, the GS will be invoked X times, where X is the number of GS instances. Each such invocation will get the same input primitive data (with a special input value used to distinguish between such instances). This is useful for more efficiently directing primitives to different layers of layered framebuffers.
Fragment Shaders
Or Pixel Shaders in D3D parlance. Even though they aren't pixels yet, may not become pixels, and they can be executed multiple times for the same pixel ;)
These are the most complicated with regard to invocation frequency. How often they execute depends on a lot of things.
FS's must be executed at least once for each pixel-sized area that a primitive rasterizes to. But they may be executed more than that.
In order to compute derivatives for texture functions, one FS invocation will often borrow values from one of its neighboring invocation. This is problematic if there is no such invocation, if a neighbor falls outside of the boundary of the primitive being rasterized.
In such cases, there will still be a neighboring FS invocation. Even though it produces no actual data, it still exists and still does work. The good part is that these helper invocations don't hurt performance. They're basically using up shader resources that would have otherwise gone unusued. Also, any attempt by such helper invocations to actually output data will be ignored by the system.
But they do still technically exist.
A less transparent issue revolves around multisampling. See, multisampling implementations (particularly in OpenGL) are allowed to decide on their own how many FS invocations to issue. While there are ways to force multisampled rendering to create an FS invocation for every sample, there is no guarantee that implementations will execute the FS only once per covered pixel outside of these cases.
For example, if I recall correctly, if you create a multisample image with a high sample count on certain NVIDIA hardware (8 to 16 or something like that), then the hardware may decide to execute the FS multiple times. Not necessarily once per sample, but once for every 4 samples or so.
So how many FS invocations do you get? At least one for every pixel-sized area covered by the primitive being rasterized. Possibly more if you're doing multisampled rendering.
Compute Shaders
The exact number of invocations that you specify. That is, the number of work groups you dispatch * the number of invocations per group specified by your CS (your local group count). No more, no less.

Get results of GPU calculations back to the CPU program in OpenGL

Is there a way to get results from a shader running on a GPU back to the program running on the CPU?
I want to generate a polygon mesh from simple voxel data based on a computational costly algorithm on the GPU but I need the result on the CPU for physics calculations.
Define "the results"?
In general, if you're doing GPGPU-style computations with OpenGL, you are going to need to structure your shaders around the needs of a rendering system. Rendering systems are designed to be one-way: data goes into them and an image is produced. Going backwards, having the rendering system produce data, is not generally how rendering systems are structured.
That doesn't mean you can't do it, of course. But you need to architect everything around the limitations of OpenGL.
OpenGL offers a number of hooks where you can write data from certain shader stages. Most of these require specialized hardware
Fragment shader outputs
Any hardware capable of fragment shaders will obviously allow you to write to the current framebuffer you're rendering. Through the use of framebuffer objects and textures with floating-point or integer image formats, you can write pretty much any data you want to a variety of images. Once in a texture, you can simply call glGetTexImage to get the rendered pixel data. Or you can just do glReadPixels to get it if the FBO is still bound. Either way works.
The primary limitations of this method are:
The number of images you can attach to the framebuffer; this limits the amount of data you can write. On pre-GL 3.x hardware, FBOs were typically limited to only 4 images plus a depth/stencil buffer. In 3.x and better hardware, you can expect a minimum of 8 images.
The fact that you're rendering. This means that you need to set up your vertex data to position a triangle exactly where you want it to modify data. This is not a trivial undertaking. It's also difficult to get useful input data, since you typically want each texel to be fairly independent of the other. Structuring your fragment shader around these limitations is difficult. Not impossible, but non-trivial in many cases.
Transform Feedback
This OpenGL 3.0 feature allows the output from the Vertex Processing stage of OpenGL (vertex shader and optional geometry shader) to be captured in one or more buffer objects.
This is much more natural for capturing vertex data that you want to play with or render again. In your case, you'll need to read it back after rendering it, perhaps with a glGetBufferSubData call, or by using glMapBufferRange for reading.
The limitations here are that you generally only can capture 4 output values, where each value is a vec4. There are also some strict layout restrictions. Some OpenGL 3.x and 4.x hardware offers the ability to write data to multiple feedback streams, which can all be written into different buffers.
Image Load/Store
This GL 4.2 feature represents the pinnacle of writing: you can bind an image (a buffer texture, if you want to write to a buffer), and just write to it. There are memory ordering constraints that you need to work within.
It's very flexible, but very complex. Besides the difficulty in using it properly, there are a number of limitations. The number of images you can write to will be fairly limited, perhaps 8 or so. And implementations may have total write limits, so that 8 images to write to may have to be shared by the fragment shader's outputs.
What's more, image outputs are only guaranteed for the fragment shader (and 4.3's compute shaders). That is, hardware is allowed to forbid you from using image load/store on non-FS/CS shader stages.

Is it possible to reuse glsl vertex shader output later?

I have a huge mesh(100k triangles) that needs to be drawn a few times and blend together every frame. Is it possible to reuse the vertex shader output of the first pass of mesh, and skip the vertex stage on later passes? I am hoping to save some cost on the vertex pipeline and rasterization.
Targeted OpenGL 3.0, can use features like transform feedback.
I'll answer your basic question first, then answer your real question.
Yes, you can store the output of vertex transformation for later use. This is called Transform Feedback. It requires OpenGL 3.x-class hardware or better (aka: DX10-hardware).
The way it works is in two stages. First, you have to set your program up to have feedback-based varyings. You do this with glTransformFeedbackVaryings. This must be done before linking the program, in a similar way to things like glBindAttribLocation.
Once that's done, you need to bind buffers (given how you set up your transform feedback varyings) to GL_TRANSFORM_FEEDBACK_BUFFER with glBindBufferRange, thus setting up which buffers the data are written into. Then you start your feedback operation with glBeginTransformFeedback and proceed as normal. You can use a primitive query object to get the number of primitives written (so that you can draw it later with glDrawArrays), or if you have 4.x-class hardware (or AMD 3.x hardware, all of which supports ARB_transform_feedback2), you can render without querying the number of primitives. That would save time.
Now for your actual question: it's probably not going to help buy you any real performance.
You're drawing terrain. And terrain doesn't really get any transformation. Typically you have a matrix multiplication or two, possibly with normals (though if you're rendering for shadow maps, you don't even have that). That's it.
Odds are very good that if you shove 100,000 vertices down the GPU with such a simple shader, you've probably saturated the GPU's ability to render them all. You'll likely bottleneck on primitive assembly/setup, and that's not getting any faster.
So you're probably not going to get much out of this. Feedback is generally used for either generating triangle data for later use (effectively pseudo-compute shaders), or for preserving the results from complex transformations like matrix palette skinning with dual-quaternions and so forth. A simple matrix multiply-and-go will barely be a blip on the radar.
You can try it if you like. But odds are you won't have any problems. Generally, the best solution is to employ some form of deferred rendering, so that you only have to render an object once + X for every shadow it casts (where X is determined by the shadow mapping algorithm). And since shadow maps require different transforms, you wouldn't gain anything from feedback anyway.

Web-GL : Multiple Fragment Shaders per Program

Does anyone know if it's possible to have multiple fragment shaders run serially in a single Web-GL "program"? I'm trying to replicate some code I have written in WPF using shader Effects. In the WPF program I would wrap an image with multiple borders and each border would have an Effect attached to it (allowing for multiple Effects to run serially on the same image).
I'm afraid you're probably going to have to clarify your question a bit, but I'll take a stab at answering anyway:
WebGL can support, effectively, as many different shaders as you want. (There are of course practical limits like available memory but you'd have to be trying pretty hard to bump into them by creating too many shaders.) In fact, most "real world" WebGL/OpenGL applications will use a combination of many different shaders to produce the final scene rendered to your screen. (A simple example: Water will usually be rendered with a different shader or set of shaders than the rest of the environment around it.)
When dispatching render commands only one shader program may be active at a time. The currently active program is specified by calling gl.useProgram(shaderProgram); after which any geometry drawn will be rendered with that program. If you want to render an effect that requires multiple different shaders you will need to group them by shader and draw each batch separately:
gl.useProgram(shader1);
// Setup shader1 uniforms, bind the appropriate buffers, etc.
gl.drawElements(gl.TRIANGLES, shader1VertexCount, gl.UNSIGNED_SHORT, 0); // Draw geometry that uses shader1
gl.useProgram(shader2);
// Setup shader2 uniforms, bind the appropriate buffers, etc.
gl.drawElements(gl.TRIANGLES, shader2VertexCount, gl.UNSIGNED_SHORT, 0); // Draw geometry that uses shader2
// And so on...
The other answers are on the right track. You'd either need to create the shader on the fly that applies all the effects in one shader or framebuffers and apply the effects one at a time. There's an example of the later here
WebGL Image Processing Continued
As Toji suggested, you might want to clarify your question. If I understand you correctly, you want to apply a set of post-processing effects to an image.
The simple answer to your question is: No, you can't use multiple fragment shaders with one vertex shader.
However, there are two ways to accomplish this: First, you can write everything in one fragment shader and combine them in the end. This depends on the effects you want to have!
Second, you can write multiple shader programs (one for each effect) and write your results to a fragment buffer object (render to texture). Each shader would get the results of the previous effect and apply the next one. This would be a bit more complicated, but it is the most flexible approach.
If you mean to run several shaders in a single render pass, like so (example pulled from thin air):
Vertex color
Texture
Lighting
Shadow
...each stage attached to a single WebGLProgram object, and each stage with its own main() function, then no, GLSL doesn't work this way.
GLSL works more like C/C++, where you have a single global main() function that acts as your program's entry point, and any arbitrary number of libraries attached to it. The four examples above could each be a separate "library," compiled on its own but linked together into a single program, and invoked by a single main() function, but they may not each define their own main() function, because such definitions in GLSL are shared across the entire program.
This unfortunately requires you to write separate main() functions (at a minimum) for every shader you intend to use, which leads to a lot of redundant programming, even if you plan to reuse the libraries themselves. That's why I ended up writing a glorified string mangler to manage my GLSL libraries for Jax; I'm not sure how useful the code will be outside of my framework, but you are certainly free to take a look at it, and make use of anything you find helpful. The relevant files are:
lib/assets/javascripts/jax/shader.js.coffee
lib/assets/javascripts/jax/shader/program.js.coffee
spec/javascripts/jax/shader_spec.js.coffee (tests and usage examples)
spec/javascripts/jax/shader/program_spec.js.coffee (more tests and usage examples)
Good luck!

Self-Referencing Renderbuffers in OpenGL

I have some OpenGL code that behaves inconsistently across different
hardware. I've got some code that:
Creates a render buffer and binds a texture to its color buffer (Texture A)
Sets this render buffer as active, and adjusts the viewport, etc
Activates a pixel shader (gaussian blur, in this instance).
Draws a quad to full screen, with texture A on it.
Unbinds the renderbuffer, etc.
On my development machine this works fine, and has the intended
effect of blurring the texture "in place", however on other hardware
this does not seem to work.
I've gotten it down to two possibilities.
A) Making a renderbuffer render to itself is not supposed to work, and
only works on my development machine due to some sort of fluke.
Or
B) This approach should work, but something else is going wrong.
Any ideas? Honestly I have had a hard time finding specifics about this issue.
A) is the correct answer. Rendering into the same buffer while reading from it is undefined. It might work, it might not - which is exactly what is happening.
In OpenGL's case, framebuffer_object extension has section "4.4.3 Rendering When an Image of a Bound Texture Object is Also Attached to the Framebuffer" which tells what happens (basically, undefined). In Direct3D9, the debug runtime complains loudly if you use that setup (but it might work depending on hardware/driver). In D3D10 the runtime always unbinds the target that is used as destination, I think.
Why this is undefined? One of the reasons GPUs are so fast is that they can make a lot of assumptions. For example, they can assume that units that fetch pixels do not need to communicate with units that write pixels. So a surface can be read, N cycles later the read is completed, N cycles later the pixel shader ends it's execution, then it it put into some output merge buffers on the GPU, and finally at some point it is written to memory. On top of that, the GPUs rasterize in "undefined" order (one GPU might rasterize in rows, another in some cache-friendly order, another in totally another order), so you don't know which portions of the surface will be written to first.
So what you should do is create several buffers. In blur/glow case, two is usually enough - render into first, then read & blur that while writing into second. Repeat this process if needed in ping-pong way.
In some special cases, even the backbuffer might be enough. You simply don't do a glClear, and what you have drawn previously is still there. The caveat is, of course, that you can't really read from the backbuffer. But for effects like fading in and out, this works.