I'm doing some work with compute shaders, and I've noticed that if two invocations write to the same location on a texture using imageStore, you get a flickering effect when the texture is rendered since access speeds are not guaranteed, and so sometimes one invocation gets there last and sometimes its the other one. I would like my final colour value to be, say, the value with the highest value of red. Is there a way for me to determine that within the shader?
I think there was some confusion, so I'll just give some more info. I'm working with data that I've bound on the CPU as GL_UNSIGNED_BYTE, and I access it using
layout (r8, binding = 0) uniform image3D visualTexture;
At this stage, I simply just want to stop the flickering, ie, some shader invocation takes preference over the others. The highest value would be ideal, but I want this to be fast.
Image atomic operations are only permitted for single-channel, 32-bit formats (integers and float). So just change your data to use 32-bit integers, rather than 8-bit integers, and use imageAtomicMax to set values into the image.
You could just use the 32-bit integer buffer as an intermediary, with a post-process that reads the 32-bit data and writes out to an 8-bit buffer.
Related
I want to count the number of fragments on each pixel (with depth test disabled). I have tried enabling blending and setting glBlendFunc(GL_ONE,GL_ONE) to accumulate them. This works just fine with a float32 format texture binding to a FBO, but I think a uint32 format texture (e.g. GL_R32UI) is more intuitive for this task. However, I can't get the expected behavior. It seems each fragment just overwrites the texture. I just wonder if there's other methods to do the accumulation on integer format textures.
However, I can't get the expected behavior. It seems each fragment just overwrites the texture.
That's because the blending stage is not available on pure integer framebuffer formats.
but I think a uint32 format texture (e.g. GL_R32UI) is more intuitive for this task.
Well, is it? What does "intuitive" even mean here? First of all, a GL_R16F format is probably enough for a reasonable of overdraw, and it would reduce bandwidth demands a lot (which seems to be the limiting factor for such a pass).
I just wonder if there's other methods to do the accumulation on integer format textures.
I can see two ways, I doubt that either of them is really more "intuitive", but I you absolutely need the result as integer, you could try these:
Don't use a framebuffer at all, but use image load/store on an unsigned integer texture in the fragment shader. Ust atomic operations, in particular imageAtomicAdd to count the number of fragments at each fragment location. Note that if you go that route, your're outside of the GL's automatic synchronization paths, and you'll have to add an exlicit glMemoryBarrier call after that render pass.
You could also just use a standard normalized integer format like GL_RED8 (or GL_RED16) use blending as before, but have the fragment shader output 1.0/255.0 (or 1.0/65535.0, respectively). The data which ends up in the framebuffer will be integer in the end. If you need this data on the CPU, you can directly read it back, if you need it on the GPU, you can use glTextureView to reinterpret the data as an unnormalized integer texture without a copy/conversion step.
Shaders have invocations, which each are (usually) given a unique set of input data, and each (usually) write to their own separate output data. When you issue a rendering command, how many times does each shader get invoked?
Each shader stage has its own frequency of invocations. I will use the OpenGL terminology, but D3D works the same way (since they're both modelling the same hardware relationships).
Vertex Shaders
These are the second most complicated. They execute once for every input vertex... kinda. If you are using non-indexed rendering, then the ratio is exactly 1:1. Every input vertex will execute on a separate vertex shader instance.
If you are using indexed rendering, then it gets complicated. It's more-or-less 1:1, each vertex having its own VS invocation. However, thanks to post-T&L caching, it is possible for a vertex shader to be executed less than once per input vertex.
See, a vertex shader's execution is assumed to create a 1:1 mapping between input vertex data and output vertex data. This means if you pass identical input data to a vertex shader (in the same rendering command), your VS is expected to generate identical output data. So if the hardware can detect that it is about to execute a vertex shader on the same input data that it has used previously, it can skip that execution and simply use the outputs from the previous execution. Assuming it has those values lying around, such as in a cache.
Hardware detects this by using the vertex's index (which is why it doesn't work for non-indexed rendering). If the same index is provided to a vertex shader, it is assumed that the shader will get all of the same input values, and therefore will generate the same output values. So the hardware will cache output values based on indices. If an index is in the post-T&L cache, then the hardware will skip the VS's execution and just use the output values.
Instancing only slightly complicates post-T&L caching. Rather than caching solely on the vertex index, it caches based on the index and instance ID. So it only uses the cached data if both values are the same.
So generally, VS's execute once for every vertex, but if you optimize your geometry with indexed data, it can execute fewer times. Sometimes much fewer, depending on how you do it.
Tessellation Control Shaders
Or Hull Shaders in D3D parlance.
The TCS is very simple in this regard. It will execute exactly once for each vertex in each patch of the rendering command. No caching or other optimizations are done here.
Tessellation Evaluation Shaders
Or Domain Shaders in D3D parlance.
The TES executes after the tessellation primitive generator has generated new vertices. Because of that, how frequently it executes will obviously depend on your tessellation parameters.
The TES takes vertices generated by the tessellator and outputs vertices. It does so in a 1:1 ratio.
But similar to Vertex Shaders, it is not necessarly 1:1 for each vertex in each of the output primitives. Like a VS, the TES is assumed to provide a direct 1:1 mapping between locations in the tessellated primitives and output parameters. So if you invoke a TES multiple times with the same patch location, it is expected to output the same value.
As such, if generated primitives share vertices, the TES will often only be invoked once for such shared vertices. Unlike vertex shaders, you have no control over how much the hardware will utilize this. The best you can do is hope that the generation algorithm is smart enough to minimize how often it calls the TES.
Geometry Shaders
A Geometry Shader will be invoked once for each point, line or triangle primitive, either directly given by the rendering command or generated by the tessellator. So if you render 6 vertices as unconnected lines, your GS will be invoked exactly 3 times.
Each GS invocation can generate zero or more primitives as output.
The GS can use instancing internally (in OpenGL 4.0 or Direct3D 11). This means that, for each primitive that reaches the GS, the GS will be invoked X times, where X is the number of GS instances. Each such invocation will get the same input primitive data (with a special input value used to distinguish between such instances). This is useful for more efficiently directing primitives to different layers of layered framebuffers.
Fragment Shaders
Or Pixel Shaders in D3D parlance. Even though they aren't pixels yet, may not become pixels, and they can be executed multiple times for the same pixel ;)
These are the most complicated with regard to invocation frequency. How often they execute depends on a lot of things.
FS's must be executed at least once for each pixel-sized area that a primitive rasterizes to. But they may be executed more than that.
In order to compute derivatives for texture functions, one FS invocation will often borrow values from one of its neighboring invocation. This is problematic if there is no such invocation, if a neighbor falls outside of the boundary of the primitive being rasterized.
In such cases, there will still be a neighboring FS invocation. Even though it produces no actual data, it still exists and still does work. The good part is that these helper invocations don't hurt performance. They're basically using up shader resources that would have otherwise gone unusued. Also, any attempt by such helper invocations to actually output data will be ignored by the system.
But they do still technically exist.
A less transparent issue revolves around multisampling. See, multisampling implementations (particularly in OpenGL) are allowed to decide on their own how many FS invocations to issue. While there are ways to force multisampled rendering to create an FS invocation for every sample, there is no guarantee that implementations will execute the FS only once per covered pixel outside of these cases.
For example, if I recall correctly, if you create a multisample image with a high sample count on certain NVIDIA hardware (8 to 16 or something like that), then the hardware may decide to execute the FS multiple times. Not necessarily once per sample, but once for every 4 samples or so.
So how many FS invocations do you get? At least one for every pixel-sized area covered by the primitive being rasterized. Possibly more if you're doing multisampled rendering.
Compute Shaders
The exact number of invocations that you specify. That is, the number of work groups you dispatch * the number of invocations per group specified by your CS (your local group count). No more, no less.
I'm writing a program that uses the GPU to calculate stuff, and I want to read data from the framebuffers to be used in my client code. The framebuffers I'm using are about 40 textures, all 1024x1024 in size, all of which contain data that needs read, but only very sparcely, like 50 or so pixels in arbitrary x/y coordinates from each texture. Using glReadPixels for each texture, for each frame, is proving too costly for me to do though...
I only need to read a few select pixels from each texture, is there a way to quickly gather their data without needing to download every entire texture from the GPU?
This sounds fairly expensive no matter how you slice it. A couple of approaches come to mind:
What I would try first is glReadPixels(), but with using a PBO. Bind a buffer large enough to hold all the pixels to the GL_PIXEL_PACK_BUFFER target, and then submit the glReadPixels() calls, with offsets to place the results in distinct sections of the buffer. Then call glMapBufferRange() to read back the values.
An alternate approach is that you copy all the pixels you want to read into a single texture. You could use glBlitFramebuffer() or glCopyTexSubImage2D(). Then use a single glReadPixels() or glGetTexImage() call to get all the data from this texture.
Both of these approaches should result in about the same amount of work and synchronization overhead. But one or the other could be more efficient, depending on which paths in the driver are better optimized.
As the earlier answer already suggested, I would make very sure that you really need this, and there isn't any way to keep and process the data on the GPU. Any time you read back data, you introduce synchronization between GPU and CPU, which is mostly harmful to performance.
Do you have any restrictions on what OpenGL version you can use? If not, it sounds like you should look into compute shaders. You say that you are calculating data, so I assume that you are "abusing" the rendering pipeline for your application, especially the fragment shader, and store fragment data in the framebuffer that is interpreted as something else than color.
If this is the case, then all you need is a shader storage buffer and an atomic counter. At some point right now you are deciding that fragment (x, y, z [z being the texture index]) should have value v. So in your compute shader, you do your calculation as you would in the fragment shader, but as output, you store a tuple (x, y, z, v). You store this tuple in the shader storage buffer at the index of the atomic counter which you increment after each written element. In the end, you have your data stored compactly in the buffer and only need to read back these elements. The exact number is the value the atomic counter holds after termination. Download the buffer with glGetBufferSubData into an array of location-value pairs, iterate over it and do your CPU magic.
If you need to copy the data from the GPU to the CPU memory, there is no way (AFAIK) around using glReadPixels.
Depending on what platform you're using, and the specific of your programs, you can try several optimizations, using FBOs:
Copy only part of the texture, assuming you know the locations of the pixels. Note that in most cases it still faster to copy the entire texture instead of issuing several small reads
If you don't need 32 bit textures, you can render to a lower color resolution. The specific depends on your platform extensions.
Maybe you don't really need to copy the pixels since you plan to use them as a texture input to the next stage? In that case you copy the pixels directly on the GPU using glCopyTexImage2D
I have a compute shader that is dispatched iteratively and uses a 2d texture to temporarily store values. Each invocation id accesses a particular row in the texture.
The problem is, this texture must be initialized to 0's before each shader dispatch.
Currently I use a loop at the end of the shader code that uses imageStore() to reset all pixels in the respective row back to 0.
for (uint i = 0; i < CONSTANT_SIZE; i++)
{
imageStore( myTexture, ivec2( i, global_invocation_id ), vec4( 0, 0, 0, 0) );
}
I was wondering if there is a faster way of doing this, a way to set more than one pixel with a single call (preferably an entire row)? I've looked at the GLSL 4.3 specification on image operations but I can't find one that doesn't require a specific pixel location.
If there is a faster way to achieve this on the CPU I would be open to that as well, i've tried rebuffering the texture using glTexImage2D(), but there is not really any noticeable performance changes to using imageStore for each individual pixel.
The "faster way" would be to clear the texture from OpenGL, rather than in your shader. 4.4 provides a direct texture clearing function, but even something as simple as a pixel transfer via glTexSubImage2D (after a barrier of course) would probably be faster than what you're doing.
Alternatively, if all you're using this texture for is scratch memory for invocations... why are you using a texture? It'd be better to use shared variables for that. Just create an array of arrays of vec4s, where each local invocation accesses one array of the arrays. Access to those are going to be loads faster.
Given 32KB of storage for shared variables (the bare minimum allowed), if you have 8 invocations per work group, that gives each one 4KB to work with. That gives each one 256 vec4s to play with. If you move up to 16 invocations, you reduce this to 128 vec4s.
I'm using my alpha channel as an 8 bit integer index for something unrelated to blending so I want to carefully control the bit values. In particular, I need for all of the pixels from one FBO-rendered texture with a particular alpha value to match all of the pixels with the same alpha value in the shader. Experience has taught me to be careful when comparing floating point values for equality...
While setting the color values using the floating point vec4 might not cause me issues, and my understanding is that even a half precision 16bit float will be able to differentiate all 8 bit integer (0-255) values. But I would prefer to perform integer operations in the fragment shader so I am certain of the values.
Am I likely to incur a performance hit by performing integer ops in the fragment shader?
How is the output scaled? I read somewhere that it is valid to send integer vectors as color output for a fragment. But how is it scaled? If I send a uvec4 with integers 0-255 will it scale that appropriately? I'd like for it to directly write the integer value into the pixel format, for integer formats I don't want it to do any scaling. Perhaps for RGBA8 sending in an int value above 255 would clamp it to 255, and clamp negative ints to zero, and so on.
This issue is made difficult by the fact that I cannot debug by printing out the color values unless I grab the rendered images and examine them carefully. Perhaps I can draw a bright color if something fails to match.
Here is a relevant thread I found on this topic. It has confused me even more than before.
I suggest not using the color attachment's alpha channel, but an additional render target with an explicit integer format. This is available since at least OpenGL-3.1 (the oldest spec I looked at, for this answer). See the OpenGL function glBindFragDataLocation, which binds a fragments shader out variable. In your case a int out $VARIABLENAME. For input into the next state use a integer sampler. I refer you to the specification of OpenGL-3.1 and GLSL-1.30 for the details.