I know this technically isn't supported (and as far as I can tell it's undefined behavior) but is it really a fatally horrible thing to sample from a texture which is also being written to?
I ask because I need to read from a depth texture which I also need to write to, if I can't do this it means I will have to copy the depth texture and if it isn't that big of a deal I don't see the harm in simply copying it?
Thanks for any help!
Yes, it's fatal and triggers undefined behaviour. Just make a copy and read from the copy.
The explanation is simple. Since fragments are processed in parallel in a unspecified order, you might be reading from already written texels or original value texels, and there is no way of knowing what you are reading. Making a copy and reading from it ensures that you will read the correct values.
Matias and Goz covered the most important bits. Let me add a couple interesting facts:
The Direct3D runtime actively unbinds textures when you bind their underlying resource as a render-target (so you can't create the cycle there).
UAVs in Direct3D 11 actually allow read-modify-write operations on a subset of the formats (the ones that do not require a type conversion). They do not guarantee any order of operation, though. This is what is being used by a number of algorithms that do Order-independent transparency, notably (where the re-ordering is done manually).
Related
The example I was reading comes from the opengl red book.
Source code is here: https://github.com/openglredbook/examples/blob/master/src/11-oit/11-oit.cpp
I read about image load store is incoherent memory access, and does not guarantee ordering between 2 rendering command. https://www.khronos.org/opengl/wiki/Memory_Model
When I read source code for this algorithm, I see no mentioning of memory barrier.
So do I actually need to call memory barrier between the rendering command that sort the fragments and store them, and the rendering command that renders the quad?
For your general question, yes, you need an explicit memory barrier between the two operations.
On a more personal note, please stop looking at that code. I'm seeing many dubious things beyond just the lack of a barrier: the mapping of a buffer for the sole purpose of writing a single integer, a call to glTexSubImage2D that's sure to give an error because NULL is not a valid pointer parameter, etc.
Recently I looked into improving texture submissions for streaming and whatnot and despite my long searches I have not found any material presenting or even mentioning any way of using PBOs with DSA only functions.
Am I not looking in the right places or is there really no way as of yet?
All of the pixel transfer functions can take either a buffer object+offset or a client CPU pointer (unlike VAO functions, for example, which can only work with buffers now). As such, allowing you to pass a buffer object+offset directly would require having a separate entrypoint for each of the two ways of doing pixel transfer. So they would need glNamedReadPixelsToBuffer and glNamedReadPixelsToClient.
So instead of further proliferating the number of functions (and instead of forbidding using client memory), they make the buffer part work the way it always did: through a binding point. So yes, you're still going to have to bind that buffer to the PACK/UNPACK binding.
Since pixel transfers are not exactly a common operation (relative to the number of other kinds of state changing and rendering commands), and since these particular binds are not directly tied to the GPU, it shouldn't affect your code that much. Plus, there's already a lot of context state tied to pixel transfer operations; what does one more matter?
A typical OpenGL call might look like the following:
GLuint buffer;
glGenBuffers(1, &buffer);
glBindBuffer(GL_SOME_BUFFER, buffer);
...
I've read that binding of buffers and other similar functions can be quite expensive. Is it worth saving the currently bound buffer, and checking it before I bind? Such as this:
void StateManager::bindBuffer(GLenum bufType, GLuint bufID) {
if (this->m_currentBuffer[bufType] != bufID) {
glBindBuffer(bufType, bufID);
this->m_currentBuffer[bufType] = bufID;
}
}
The idea behind this being that if bufID is already bound then the expensive call to glBindBuffer is missed. Is this a worthwhile approach? I assumed that OpenGL would likely implement such an optimization already, but I have seen this pattern used in a few projects now, so I am having my doubts. I am simply interested because it would be a pretty simple thing to implement, but if it doesn't make much/any difference then I will skip it (avoiding premature optimization).
This is highly platform and vendor dependent.
You're asking if "OpenGL would implement...". As you certainly understand already, OpenGL is an API specification. There are many different implementations, and whether they check for redundant state changes is entirely an implementation decision, which can (and will) be different from implementation to implementation.
You shouldn't even expect that a given implementation handles this the same for all pieces of state.
Since this topic is somewhat close to my heart based on past experience, I was tempted to write a small essay, including a few rants. But I decided that it wouldn't belong here, so here is just a list of considerations that could affect if a given OpenGL implementation tests for redundant state changes in specific cases:
How expensive is it to actually change the state? If it's very cheap, checking for redundant changes might simply not be worth it.
How expensive is it to check for redundant changes? Normally not much, but we're looking at pieces of software where every little bit counts.
Are important apps/benchmarks redundantly changing this state on a frequent basis?
What's the philosophy on responsibilities of apps vs. responsibilities of OpenGL implementations?
And yes, this is unfortunate for everybody. For you as an app writer who wants to get ideal performance across vendors/platforms, there's really no easy solution. If you add checks to your code, they will be useless, and add extra overhead, on platforms that have the same checks in the OpenGL implementation. If you do not have checks in your code, and cannot easily avoid having these redundant state changes in the first place, you may leave performance on the table on platforms where the OpenGL implementation does not check.
The reason why state caching is a bad idea is simple: you're doing it wrong. You'll always be in danger of doing it wrong.
Oh sure, you corrected the mistake I pointed out, that different buffer bindings have different state. And maybe you're using a hash-table that makes lookup pretty quick, even if a new extension comes out that adds a new buffer binding point that didn't exist when you wrote your cache.
But that's merely the tip of the iceberg as far as object binding idiosyncrasies.
For example, did you realize that GL_ELEMENT_ARRAY_BUFFER is not actually context state? It's really VAO state, and every time you bind a new VAO, that buffer binding changes. So your VAO cache now has to change the shadowed element buffer binding too.
Also, were you aware that deleting an object automatically unbinds it from any context binding points it is currently bound to? And this is true even for objects that are attached to another object that is bound to the context; the deleted object is automatically detached.
Except that this is only true for certain object types. And even then, it's only true for the context that was current when the object was deleted. Other contexts will be unaffected.
My point is this: proper caching of state is really hard. And if you get it wrong, you will create a multitude of very subtle bugs in your application. Whereas if you just let OpenGL do its thing and structure your code so that multiple binding simply doesn't happen, then you don't have a problem.
Does utilizing textureOffset(...) increase performance compared to calculating offsets manually and using regular texture(...) function?
As there is a GL_MAX_PROGRAM_TEXEL_OFFSET property, I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects, but I cant seem to find out how it works internally anywhere?
Update:
Reformulating question: is it common among gl-drivers to make any optimizations regarding texture fetches when utilizing the textureOffset(...) function?
You're asking the wrong question. The question should not be whether the more specific function will always have better performance. The question is whether the more specific function will ever be slower.
And there's no reason to expect it to be slower. If the hardware has no specialized functionality for offset texture accesses, then the compiler will just offset the texture coordinate manually, exactly like you could. If there is hardware to help, then it will use it.
So if you have need of textureOffsets and can live within its limitations, there's no reason not to use it.
I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects
No, that's textureGather. textureOffset is for doing exactly what its name says: accessing a texture based on a texture coordinate, with an texel offset from that coordinate's location.
textueGather samples from multiple neighboring texels all at once. If you need to read a section of a texture to do bluring, textureGather (and textureGatherOffset) are going to be more useful than textureOffset.
In OpenGL/ES you have to be careful to not cause a feedback loop (reading pixels from the same texture you are writing to) when implementing render to texture functionality. For obvious reasons the behavior is undefined when you are reading and writing to the same pixels of a texture. However, is it also undefined behavior if you are reading and writing to different pixels of the same texture? An example would be if I was trying to make a texture atlas with a render texture inside. While I am rendering to the texture, I read the pixels from another texture stored in the texture atlas.
As I am not reading and writing the same pixels in the texture is the behavior still considered undefined, just because the data is coming from the same texture?
However, is it also undefined behavior if you are reading and writing to different pixels of the same texture?
Yes.
Caching is the big problem here. When you write pixel data, it is not necessarily written to the image immediately. The write is stored in a cache, so that multiple pixels can be written all at once.
Texture accesses do the same thing. The problem is that they don't have the same cache. So you can have written some data that is in the write cache, but the texture cache doesn't know about it.
Now, the specification is a bit heavy-handed here. It is theoretically possible that you can read from one area of a texture and write to another (but undefined by the spec), so long as you never read from any location you've written to, and vice versa. Obviously, that's not very helpful.
The NV_texture_barrier extension allows you to get around this. Despite being an NVIDIA extension, it is supported on ATI hardware too. The way it works is that you call the glTextureBarrierNV function when you want to flush all of the caches. That way, you can be sure that when you read from a location, you have written to it.
So the idea is that you designate one area of the texture as the write area, and another as the read area. After you have rendered some stuff, and you need to do readback, you fire off a barrier and swap texture areas. It's like texture ping-ponging, but without the heavy operation of attaching a new texture or binding an FBO, or changing the drawbuffers.
The problem is not so much the possibility of feedback loops (technically this would not result in a loop, but an undefined order in which pixels are read/written causing an undefineable behaviour), but the limitations of the access modes GPUs implement: A buffer can either only be read from or written to at any given time (gather vs. scatter access). And the GPU always sees a buffer as a whole. This is the main reason for that limitation.
Yes it still is, GPUs are massively parallel so you can't really say that you write 'one' pixel at a time, there are also cache systems that are populated when you ready a texture. If you write to the same texture, the cache would need to be synchronized, and so on.
For some insight, you can take a look at the NV_texture_barrier OpenGL extension, that is meant to add some flexibility is this area.
Yes, it's also undefined to read/write different areas of the texture. But why care if it's undefined or not, just write into another texture and avoid the problem altogether!