Can't access fbo attached texture in GLSL compute shader with gimage2D - opengl

I recently wanted to work on a compute shader for OpenGL. In this experiment, I wanted to access one of the color textures attached to a FrameBufferObject. When attempting to pass the texture to the compute shader with a layout(rgba32f) readonly image2D, nothing was passed in. I rewrote the compute shader to use a sampler2D instead. The sampler worked just fine.
I also tested the gimage2D compute shader with just a texture, that wasn't attached to anything. This also worked as expected.
I haven't found any documentation stating that a texture attached to an FBO can't be accessed in a compute shader using gimage2D. I also haven't found any documentation stating that a compute shader can't write to an FBO.
I guess my question is why can't a texture, attached to an FBO, be accessed, in a compute shader, using gimage2D? Is there documentation explaining this?

First, in regards to your statement:
"I guess my question is why can't a texture, attached to an FBO, be accessed, in a compute shader, using gimage2D?"
You don't use gimage2D, if you see a type prefixed with g in GLSL documentation it is a generic type. (e.g. gvec<N>, gsampler..., etc.) It means that the function has overloads for every kind of vec<N> or sampler.... In this case, gimage2D is the short way of saying "... this function accepts image2D, iimage2D or uimage2D".
There is no actual gimage2D type, the g prefix was invented solely for the purpose of keeping GLSL documentation short and readable ;)
I think you already know this, because the only actual code listed in the question is using image2D, but the way things were written I was not sure.
As for your actual question, you should look into memory barriers.
Pay special attention to: GL_FRAMEBUFFER_BARRIER_BIT.
Compute Shaders are scheduled separately from stages of the render pipeline; they have their own single-stage pipeline. This means that if you draw something into an FBO attachment, your computer shader may run before you even start drawing or the compute shader may use an (invalid) cached view of the data because the change made in the render pipeline was not visible to the compute pipeline. Memory barriers will help to synchronize the render pipeline and compute pipeline for resources that are shared between both.
The render pipeline has a lot of implicit synchronization and multi-stage data flow that gives a pretty straightforward sequential ordering for shaders (e.g. glDraw* initiates vertex->geometry->fragment), but the compute pipeline does away with virtually all of this in favor of explicit synchronization. There are all sorts of hazards that you need to consider with compute shaders and image load/store that you do not with traditional vertex/geometry/tessellation/fragment.
In other words, while declaring something coherent in a compute shader together with an appropriate barrier at the shader level will take care of synchronization between compute shader invocations, since the compute pipeline is separate from the render pipeline it does nothing to synchronize image load/store between a compute shader and a fragment shader. For that, you need glMemoryBarrier (...) to synchronize access to the memory resource at the command level. glDraw* (...) (entry-point for the render pipeline) is a separate command from glDispatch* (...) (entry-point for the compute pipeline) and you need to ensure these separate commands are ordered properly for image load/store to exhibit consistent behavior.
Without a memory barrier, there is no guarantee about the order commands are executed in; only that they produce results consistent with the order you issued them. In the render pipeline, which has strictly defined input/output for each shader stage, GL implementations can intelligently re-order commands while maintaining this property with relative ease. With compute shaders as well as image load/store in general, where the I/O is completely determined by run-time flow it is impossible without some help (memory barriers).
TL;DR: The reason why it works if you use a sampler and not image load/store comes down to coherency guarantees (or the lack thereof). Image load/store simply does not guarantee that reads from an image are coherent (strictly ordered) with respect anything that writes to an image, and instead requires you to explicitly synchronize access to the image. This is actually beneficial as it allows you to simultaneously read/write the same image without leading to undefined behavior, but it requires some extra effort on your part to make it work.


Frequency of shader invocations in rendering commands

Shaders have invocations, which each are (usually) given a unique set of input data, and each (usually) write to their own separate output data. When you issue a rendering command, how many times does each shader get invoked?
Each shader stage has its own frequency of invocations. I will use the OpenGL terminology, but D3D works the same way (since they're both modelling the same hardware relationships).
Vertex Shaders
These are the second most complicated. They execute once for every input vertex... kinda. If you are using non-indexed rendering, then the ratio is exactly 1:1. Every input vertex will execute on a separate vertex shader instance.
If you are using indexed rendering, then it gets complicated. It's more-or-less 1:1, each vertex having its own VS invocation. However, thanks to post-T&L caching, it is possible for a vertex shader to be executed less than once per input vertex.
See, a vertex shader's execution is assumed to create a 1:1 mapping between input vertex data and output vertex data. This means if you pass identical input data to a vertex shader (in the same rendering command), your VS is expected to generate identical output data. So if the hardware can detect that it is about to execute a vertex shader on the same input data that it has used previously, it can skip that execution and simply use the outputs from the previous execution. Assuming it has those values lying around, such as in a cache.
Hardware detects this by using the vertex's index (which is why it doesn't work for non-indexed rendering). If the same index is provided to a vertex shader, it is assumed that the shader will get all of the same input values, and therefore will generate the same output values. So the hardware will cache output values based on indices. If an index is in the post-T&L cache, then the hardware will skip the VS's execution and just use the output values.
Instancing only slightly complicates post-T&L caching. Rather than caching solely on the vertex index, it caches based on the index and instance ID. So it only uses the cached data if both values are the same.
So generally, VS's execute once for every vertex, but if you optimize your geometry with indexed data, it can execute fewer times. Sometimes much fewer, depending on how you do it.
Tessellation Control Shaders
Or Hull Shaders in D3D parlance.
The TCS is very simple in this regard. It will execute exactly once for each vertex in each patch of the rendering command. No caching or other optimizations are done here.
Tessellation Evaluation Shaders
Or Domain Shaders in D3D parlance.
The TES executes after the tessellation primitive generator has generated new vertices. Because of that, how frequently it executes will obviously depend on your tessellation parameters.
The TES takes vertices generated by the tessellator and outputs vertices. It does so in a 1:1 ratio.
But similar to Vertex Shaders, it is not necessarly 1:1 for each vertex in each of the output primitives. Like a VS, the TES is assumed to provide a direct 1:1 mapping between locations in the tessellated primitives and output parameters. So if you invoke a TES multiple times with the same patch location, it is expected to output the same value.
As such, if generated primitives share vertices, the TES will often only be invoked once for such shared vertices. Unlike vertex shaders, you have no control over how much the hardware will utilize this. The best you can do is hope that the generation algorithm is smart enough to minimize how often it calls the TES.
Geometry Shaders
A Geometry Shader will be invoked once for each point, line or triangle primitive, either directly given by the rendering command or generated by the tessellator. So if you render 6 vertices as unconnected lines, your GS will be invoked exactly 3 times.
Each GS invocation can generate zero or more primitives as output.
The GS can use instancing internally (in OpenGL 4.0 or Direct3D 11). This means that, for each primitive that reaches the GS, the GS will be invoked X times, where X is the number of GS instances. Each such invocation will get the same input primitive data (with a special input value used to distinguish between such instances). This is useful for more efficiently directing primitives to different layers of layered framebuffers.
Fragment Shaders
Or Pixel Shaders in D3D parlance. Even though they aren't pixels yet, may not become pixels, and they can be executed multiple times for the same pixel ;)
These are the most complicated with regard to invocation frequency. How often they execute depends on a lot of things.
FS's must be executed at least once for each pixel-sized area that a primitive rasterizes to. But they may be executed more than that.
In order to compute derivatives for texture functions, one FS invocation will often borrow values from one of its neighboring invocation. This is problematic if there is no such invocation, if a neighbor falls outside of the boundary of the primitive being rasterized.
In such cases, there will still be a neighboring FS invocation. Even though it produces no actual data, it still exists and still does work. The good part is that these helper invocations don't hurt performance. They're basically using up shader resources that would have otherwise gone unusued. Also, any attempt by such helper invocations to actually output data will be ignored by the system.
But they do still technically exist.
A less transparent issue revolves around multisampling. See, multisampling implementations (particularly in OpenGL) are allowed to decide on their own how many FS invocations to issue. While there are ways to force multisampled rendering to create an FS invocation for every sample, there is no guarantee that implementations will execute the FS only once per covered pixel outside of these cases.
For example, if I recall correctly, if you create a multisample image with a high sample count on certain NVIDIA hardware (8 to 16 or something like that), then the hardware may decide to execute the FS multiple times. Not necessarily once per sample, but once for every 4 samples or so.
So how many FS invocations do you get? At least one for every pixel-sized area covered by the primitive being rasterized. Possibly more if you're doing multisampled rendering.
Compute Shaders
The exact number of invocations that you specify. That is, the number of work groups you dispatch * the number of invocations per group specified by your CS (your local group count). No more, no less.

OpenGL state redundancy elimination Tree, render state priorities

I am working on a Automatic OpenGL batching method in my Game Engine, to reduce draw calls and redundant calls.
My batch tree design begins with the most expensive states and adds leafs down for each less expensive state.
Tree Root: Shaders / Programs
Siblings: Blend states ...
So my question is what are most likely the most expensive calls, in this list:
binding program
binding textures
binding buffers
buffering texture, vertex data
binding render targets
glEnable / glDisable
blend state equation, color, functions, colorWriteMask
depth stencil state depthFunc, stencilOperations, stencilFunction, writeMasks
Also wondering which method will be faster:
- Collect all batchable draw commands to single vertex buffer and call only 1 draw call (this method would also force to update matrix transforms per vertex on cpu side)
- Do not batch at all and render many small draw calls, only batch particle system ...
PS: Render Targets will always Pre or Post changed, depending on usage.
Progress so far:
Andon M. Coleman: Cheapest Uniform & Vertex Array Binding, Expensive FBO, Texture Bindings
datenwolf: Programs invalidate State Cache
1: Framebuffer states
2: Program
3: Texture Binding
N: Vertex Array binding, Uniform binding
Current execution Tree in WebGL:
Attribute Pointers
Blend State
Depth State
Stencil Front / Back State
Rasterizer State
Sampler State
Bind Buffer
Draw Arrays
Each step is a sibling hash tree, to avoid checking agains state cache inside of main render queue
Loading Textures / Programs / Shaders / Buffers happens before rendering in an extra queue, for future multi threading and also to be sure that the context is initialized before doing anything with it.
The biggest problem of self rendering objects is that you cannot control when something happens, for example if a developer calls these methods before gl is initialized, he wouldn't know why but he would have some bugs or problems...
The relative costs of such operations will of course depend on the usage pattern and your general scenario. But you might find Nvidia's "Beoynd Porting" presentation slides as a useful guide. Let me reproduce especially slide 48 here:
Relative Cost of state changes
In decreasing cost...
Render Target ~60K/s
Program ~300K/s
Texture Bindings ~1.5M/s
Vertex Format
UBO Bindings
Uniform Updates ~10M/s
This does not directly match all of the bullet points of your list. E.g. glEnable/glDisable might affect anything. Also GL's buffer bindings are nothing the GPU directly sees. Buffer bindings are mainly a client side state, depending on the target, of course. Change of blending state would be a ROP state change, and so on.
This tends to be highly platform/vendor dependent. Any numbers you may find apply to a specific GPU, platform and driver version. And there are a lot of myths floating around on the internet about this topic. If you really want to know, you need to write some benchmarks, and run them across a range of platforms.
With all these caveats:
Render target (FBO) switching tends to be quite expensive. Highly platform and architecture dependent, though. For example if you have some form of tile based architecture, pending rendering that would ideally be deferred until the end of the frame may have to be completed and flushed out. Or on more "classic" architectures, there might be compressed color buffers or buffers used for early depth testing that need consideration when render targets are switched.
Updating texture or buffer data is impossible to evaluate in general terms. It obviously depends heavily on how much data is being updated. Contrary to some claims on the internet, calls like glBufferSubData() and glTexSubImage2D() do not typically cause a synchronization. But they involve data copies.
Binding programs should not be terribly expensive, but is typically still more heavyweight than the state changes below.
Texture binding is mostly relatively cheap. But it really depends on the circumstances. For example, if you use a GPU that has VRAM, and the texture is not in VRAM at the moment, it might trigger a copy of the texture data from system memory to VRAM.
Uniform updates. This is supposedly very fast on some platforms. But it's actually moderately expensive on others. So there's a lot of variability here.
Vertex state setup (including VBO and VAO binding) is typically fast. It has to be, because it's done so frequently by most apps that it can very quickly become a bottleneck. But there are similar consideration as for textures, where buffer memory may have to be copied/mapped if it was not used recently.
General state updates, like blend states, stencil state, or write masks, are generally very fast. But there can be very substantial exceptions.
Just a typical example of why characteristics can be so different between architectures: If you change blend state, that might be sending a couple of command words on one architecture, with minimal overhead. On other architectures, blending is done as part of the fragment shader. So if you change blend state, the shader program has to be modified to patch in the code for the new blending calculation.

Is it possible to write a bunch of pixels in gl_FragColor?

Has anyone familiar with some sort of OpenGL magic to get rid of calculating bunch of pixels in fragment shader instead of only 1? Especially this issue is hot for OpenGL ES in fact meanwile flaws mobile platforms and necessary of doing things in more accurate (in performance meaning) way on it.
Are any conclusions or ideas out there?
P.S. it's known shader due to GPU architecture organisation is run in parallel for each texture monad. But maybe there techniques to raise it from one pixel to a group of ones or to implement your own glTexture organisation. A lot of work could be done faster this way within GPU.
OpenGL does not support writing to multiple fragments (meaning with distinct coordinates) in a shader, for good reason, it would obstruct the GPUs ability to compute each fragment in parallel, which is its greatest strength.
The structure of shaders may appear weird at first because an entire program is written for only one vertex or fragment. You might wonder why can't you "see" what is going on in neighboring parts?
The reason is an instance of the shader program runs for each output fragment, on each core/thread simultaneously, so they must all be independent of one another.
Parallel, independent, processing allows GPUs to render quickly, because the total time to process a batch of pixels is only as long as the single most intensive pixel.
Adding outputs with differing coordinates greatly complicates this.
Suppose a single fragment was written to by two or more instances of a shader.
To ensure correct results, the GPU can either assign one to be an authority and ignore the other (how does it know which will write?)
Or you can add a mutex, and have one wait around for the other to finish.
The other option is to allow a race condition regarding whichever one finishes first.
Either way this would immensely slows down the process, make the shaders ugly, and introduces incorrect and unpredictable behaviour.
Well firstly you can calculate multiple outputs from a single fragment shader in OpenGL 3 and up. A framebuffer object can have more than one RGBA surfaces (Renderbuffer Objects) attached and generate an RGBA for each of them by using gl_FragData[n] instead of gl_FragColor. See chapter 8 of the 5th edition OpenGL SuperBible.
However, the multiple outputs can only be generated for the same X,Y pixel coordinates in each buffer. This is for the same reason that an older style fragment shader can only generate one output, and can't change gl_FragCoord. OpenGL guarantees that in rendering any primitive, one and only one fragment shader will write to any X,Y pixel in the destination framebuffer(s).
If a fragment shader could generate multiple pixel values at different X,Y coords, it might try to write to the same destination pixel as another execution of the same fragment shader. Same if the fragment shader could change the pixel X or Y. This is the classic multiple threads trying to update shared memory problem.
One way to solve it would be to say "if this happens, the results are unpredictable" which sucks from the programmer point of view because it's completely out of your control. Or fragment shaders would have to lock the pixels they are updating, which would make GPUs far more complicated and expensive, and the performance would suck. Or fragment shaders would execute in some defined order (eg top left to bottom right) instead of in parallel, which wouldn't need locks but the performance would suck even more.

Web-GL : Multiple Fragment Shaders per Program

Does anyone know if it's possible to have multiple fragment shaders run serially in a single Web-GL "program"? I'm trying to replicate some code I have written in WPF using shader Effects. In the WPF program I would wrap an image with multiple borders and each border would have an Effect attached to it (allowing for multiple Effects to run serially on the same image).
I'm afraid you're probably going to have to clarify your question a bit, but I'll take a stab at answering anyway:
WebGL can support, effectively, as many different shaders as you want. (There are of course practical limits like available memory but you'd have to be trying pretty hard to bump into them by creating too many shaders.) In fact, most "real world" WebGL/OpenGL applications will use a combination of many different shaders to produce the final scene rendered to your screen. (A simple example: Water will usually be rendered with a different shader or set of shaders than the rest of the environment around it.)
When dispatching render commands only one shader program may be active at a time. The currently active program is specified by calling gl.useProgram(shaderProgram); after which any geometry drawn will be rendered with that program. If you want to render an effect that requires multiple different shaders you will need to group them by shader and draw each batch separately:
// Setup shader1 uniforms, bind the appropriate buffers, etc.
gl.drawElements(gl.TRIANGLES, shader1VertexCount, gl.UNSIGNED_SHORT, 0); // Draw geometry that uses shader1
// Setup shader2 uniforms, bind the appropriate buffers, etc.
gl.drawElements(gl.TRIANGLES, shader2VertexCount, gl.UNSIGNED_SHORT, 0); // Draw geometry that uses shader2
// And so on...
The other answers are on the right track. You'd either need to create the shader on the fly that applies all the effects in one shader or framebuffers and apply the effects one at a time. There's an example of the later here
WebGL Image Processing Continued
As Toji suggested, you might want to clarify your question. If I understand you correctly, you want to apply a set of post-processing effects to an image.
The simple answer to your question is: No, you can't use multiple fragment shaders with one vertex shader.
However, there are two ways to accomplish this: First, you can write everything in one fragment shader and combine them in the end. This depends on the effects you want to have!
Second, you can write multiple shader programs (one for each effect) and write your results to a fragment buffer object (render to texture). Each shader would get the results of the previous effect and apply the next one. This would be a bit more complicated, but it is the most flexible approach.
If you mean to run several shaders in a single render pass, like so (example pulled from thin air):
Vertex color
...each stage attached to a single WebGLProgram object, and each stage with its own main() function, then no, GLSL doesn't work this way.
GLSL works more like C/C++, where you have a single global main() function that acts as your program's entry point, and any arbitrary number of libraries attached to it. The four examples above could each be a separate "library," compiled on its own but linked together into a single program, and invoked by a single main() function, but they may not each define their own main() function, because such definitions in GLSL are shared across the entire program.
This unfortunately requires you to write separate main() functions (at a minimum) for every shader you intend to use, which leads to a lot of redundant programming, even if you plan to reuse the libraries themselves. That's why I ended up writing a glorified string mangler to manage my GLSL libraries for Jax; I'm not sure how useful the code will be outside of my framework, but you are certainly free to take a look at it, and make use of anything you find helpful. The relevant files are:
spec/javascripts/jax/ (tests and usage examples)
spec/javascripts/jax/shader/ (more tests and usage examples)
Good luck!

Shader framebuffer readback

I was wondering if there is support in the newer shader models to read-back a pixel value from the target framebuffer. I assume that this is alrdy done in later (non-programmable) stages in the drawing pipeline which made me hope that this feature might have been added into the programmable pipeline.
I am aware that it is possible to draw to a texture bound framebuffer and then send this texture to the shader, I was just hoping for a more elegant way to achieve the same functionality.
As Andrew notes, the framebuffer access is logically a separate stage from the fragment shader, so reading the framebuffer in the fragment shader is impossible. The reason for this (to answer Andrew's question) is a combination of performance and the ordering requirements of the graphics pipeline. The way the rendering pipeline is defined, framebuffer blending operations MUST occur in the same order as the triangles/primitives that went into the beginning of the pipeline. The fragment shaders, on the other hand, can happen in any order. So by having them be separate stages, the GPU is free to run fragment shaders as fast as it can, as their inputs become available, without having to synchronize between them. As long as it maintains enough bufffer space to hold on to the outputs of the fragment shaders, so that they can be accumulated and allow the framebuffer blends and writes to occur in order, all is well, as the results of any given fragment shader are not visible until after the blending stage.
If there was a way for the fragment shader to read the framebuffer, it would require some sort of synchronization to ensure that those reads happen in order, thus greatly slowing things down.
No. As you mention, rendering to a texture is the way to achieve that functionality.
If you take a look at a block diagram of a GPU pipeline, you'll see that the blending stage - which is what combines fragment shader output with the framebuffer - is separate from the fragment shader and is fixed-function.
I'm not a GPU designer - so I can only speculate the reason for this. Presumably it is to keep framebuffer access fast and insulate the fragment shader stage from the frame buffer so that it can be better parallelised. There are probably also issues regarding multi-sampling, and so on.
(Not to mention that fixed-function blending is "good enough" in most cases.)
Actually I think this is now doable with Direct3D 11 SM 5.0 (I didn't test it though).
You can bind an UAV to a PS 5.0, for allowing read and write operations on it using method OMSetRenderTargetsAndUnorderedAccessViews.
In that case the backbuffer of the swap chain in which you render has to be created with flag DXGI_USAGE_UNORDERED_ACCESS (I guess).
This is used in DXSDK OIT11 sample.
It is possible to read back the contents of the frame buffer in the fragment shader with Shader_framebuffer_fetch extension. The support can be added to the GPU with some performance loss. In fact, these days I'm working on to add the support of this extension in the OpenGL ES2.0 driver of a well known GPU brand in the consumer electronics market.
You can draw to a texture TEX (using a render target view) and then bind that as an input to another shader (using a shader resource view). TEX is then a pseduo-framebuffer.