In OpenGL/ES you have to be careful to not cause a feedback loop (reading pixels from the same texture you are writing to) when implementing render to texture functionality. For obvious reasons the behavior is undefined when you are reading and writing to the same pixels of a texture. However, is it also undefined behavior if you are reading and writing to different pixels of the same texture? An example would be if I was trying to make a texture atlas with a render texture inside. While I am rendering to the texture, I read the pixels from another texture stored in the texture atlas.
As I am not reading and writing the same pixels in the texture is the behavior still considered undefined, just because the data is coming from the same texture?
However, is it also undefined behavior if you are reading and writing to different pixels of the same texture?
Yes.
Caching is the big problem here. When you write pixel data, it is not necessarily written to the image immediately. The write is stored in a cache, so that multiple pixels can be written all at once.
Texture accesses do the same thing. The problem is that they don't have the same cache. So you can have written some data that is in the write cache, but the texture cache doesn't know about it.
Now, the specification is a bit heavy-handed here. It is theoretically possible that you can read from one area of a texture and write to another (but undefined by the spec), so long as you never read from any location you've written to, and vice versa. Obviously, that's not very helpful.
The NV_texture_barrier extension allows you to get around this. Despite being an NVIDIA extension, it is supported on ATI hardware too. The way it works is that you call the glTextureBarrierNV function when you want to flush all of the caches. That way, you can be sure that when you read from a location, you have written to it.
So the idea is that you designate one area of the texture as the write area, and another as the read area. After you have rendered some stuff, and you need to do readback, you fire off a barrier and swap texture areas. It's like texture ping-ponging, but without the heavy operation of attaching a new texture or binding an FBO, or changing the drawbuffers.
The problem is not so much the possibility of feedback loops (technically this would not result in a loop, but an undefined order in which pixels are read/written causing an undefineable behaviour), but the limitations of the access modes GPUs implement: A buffer can either only be read from or written to at any given time (gather vs. scatter access). And the GPU always sees a buffer as a whole. This is the main reason for that limitation.
Yes it still is, GPUs are massively parallel so you can't really say that you write 'one' pixel at a time, there are also cache systems that are populated when you ready a texture. If you write to the same texture, the cache would need to be synchronized, and so on.
For some insight, you can take a look at the NV_texture_barrier OpenGL extension, that is meant to add some flexibility is this area.
Yes, it's also undefined to read/write different areas of the texture. But why care if it's undefined or not, just write into another texture and avoid the problem altogether!
Related
Recently I looked into improving texture submissions for streaming and whatnot and despite my long searches I have not found any material presenting or even mentioning any way of using PBOs with DSA only functions.
Am I not looking in the right places or is there really no way as of yet?
All of the pixel transfer functions can take either a buffer object+offset or a client CPU pointer (unlike VAO functions, for example, which can only work with buffers now). As such, allowing you to pass a buffer object+offset directly would require having a separate entrypoint for each of the two ways of doing pixel transfer. So they would need glNamedReadPixelsToBuffer and glNamedReadPixelsToClient.
So instead of further proliferating the number of functions (and instead of forbidding using client memory), they make the buffer part work the way it always did: through a binding point. So yes, you're still going to have to bind that buffer to the PACK/UNPACK binding.
Since pixel transfers are not exactly a common operation (relative to the number of other kinds of state changing and rendering commands), and since these particular binds are not directly tied to the GPU, it shouldn't affect your code that much. Plus, there's already a lot of context state tied to pixel transfer operations; what does one more matter?
I saw different opinions.
For now, I only concern about color data.
In Chapter 28. Graphics Pipeline Performance, it says:
Avoid extraneous color-buffer clears. If every pixel is guaranteed to
be overwritten in the frame buffer by your application, then avoid
clearing color, because it costs precious bandwidth.
In How does glClear() improve performance?, it quotes from Apple's Technical Q&A on addressing flickering (QA1650):
You must provide a color to every pixel on the screen. At the
beginning of your drawing code, it is a good idea to use glClear() to
initialize the color buffer. A full-screen clear of each of your
color, depth, and stencil buffers (if you're using them) at the start
of a frame can also generally improve your application's performance.
And one answer in that post:
By issuing a glClear command, you are telling the hardware that you do
not need previous buffer content, thus it does not need to copy the
color/depth/whatever from the framebuffer to the smaller tile memory.
To that answer, my question is:
If there is no blending, why do we need to read color data from the framebuffer.
(For now, I only concern about color data)
But anyway, in general, do I need to call glClear(GL_COLOR_BUFFER_BIT)
In Chapter 28. Graphics Pipeline Performance, it says:
There are a lot of different kinds of hardware. On hardware that was prevalent when GPU Gems #1 was printed, this advice was sound. Nowadays it no longer is.
Once upon a time, clearing buffers actually meant that the hardware would go to each pixel and write the clear value. This process obviously took a non-trivial amount of GPU time, so high-performance application developers did their best to avoid incurring the wrath of the clear operation.
Nowadays (and by which, I mean pretty much any GPU made in the last 8-10 years at least), graphics chips are smarter about clears. Instead of doing a clear, they play games with the framebuffer's caches.
The value a framebuffer image is cleared to matters when doing read/modify/write operations. This includes blending and such, but it also includes any form of depth or stencil testing. In order to do a RMF operation, you must first read the value that's there.
This is where the cleverness comes in. When you "clear" a framebuffer image, nothing gets written. Instead, the framebuffer images address space is invalidated. When a read operation happens to an invalidated address, it simply returns the clear value. This costs zero bandwidth. Indeed, it saves bandwidth, because the read operation doesn't actually have to read memory. It just fetches a clear value.
Depending on how the cache works, this may even be faster when doing pure write operations. But that rather depends on different hardware.
For mobile hardware that uses tile-based rendering, this matters even more. Before a tile can begin processing, it has to read the current values of the framebuffer images. If the images are cleared, it doesn't need to read anything; it simply sets the tile memory to the clear color.
This case matters a lot even if you're not blending to the framebuffer. Why? Because neither the GPU nor the API knows that you won't be blending. It only knows that you're going to perform some number of rendering operations to that image. So it must assume the worst and read the image into the tiles. Unless you cleared it beforehand, of course.
In short, when using those images for framebuffers, clearing the images first is generally no slower than not clearing the images.
The above all assumes that you clear the entire image. If you're only clearing a sub-region of the image, then such optimizations are less likely to happen. Though it may still be possible, at least for the optimizations that are based on cache behavior.
I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.
The GLSL specification states, for the "coherent" memory qualifier: "memory variable where reads and writes are coherent with reads and writes from other shader invocations".
In practice, I'm unsure how this is interpreted by modern-day GPU drivers with regards to multiple rendering passes. When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes? For my purposes, I define a pass as a "glBindFramebuffer-glViewPort-glUseProgram-glDrawABC-glDrawXYZ-glDraw123" cycle; where I'm currently executing 2 such passes per "render loop iteration" but may have more per iteration later on.
When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes?
It means exactly what it says: "other shader invocations". It could be the same program code. It could be different code. It doesn't matter: shader invocations that aren't this one.
Normally, OpenGL handles synchronization for you, because OpenGL can track this fairly easily. If you map a range of a buffer object, modify it, and unmap it, OpenGL knows how much stuff you've (potentially) changed. If you use glTexSubImage2D, it knows how much stuff you changed. If you do transform feedback, it can know exactly how much data was written to the buffer.
If you do transform feedback into a buffer, then bind it as a source for vertex data, OpenGL knows that this will stall the pipeline. That it must wait until the transform feedback has completed, and then clear some caches in order to use the buffer for vertex data.
When you're dealing with image load/store, you lose all of this. Because so much could be written in a completely random, unknown, and unknowable fashion, OpenGL generally plays really loose with the rules in order to allow you flexibility to get the maximum possible performance. This triggers a lot of undefined behavior.
In general, the only rules you can follow are those outlined in section 2.11.13 of the OpenGL 4.2 specification. The biggest one (for shader-to-shader talk) is the rule on stages. If you're in a fragment shader, it is safe to assume that the vertex shader(s) that specifically computed the point/line/triangle for your triangle have completed. Therefore, you can freely load values that were stored by them. But only from the ones that made you.
Your shaders cannot make assumptions that shaders executed in previous rendering commands have completed (I know that sounds odd, given what was just said, but remember: "only from the ones that made you"). Your shaders cannot make assumptions that other invocations of the same shader, using the same uniforms, images, textures, etc, in the same rendering command have completed, except where the above applies.
The only thing you can assume is that writes your shader instance itself made are visible... to itself. So if you do an imageStore and do an imageLoad to the same memory location through the same variable, then you are guaranteed to get the same value back.
Well, unless someone else wrote to it in the meantime.
Your shaders cannot assume that a later rendering command will certainly fetch values written (via image store or atomic updates) by a previous one. No matter how much later! It doesn't matter what you've bound to the context. It doesn't matter what you've uploaded or downloaded (technically. Odds are you'll get correct behavior in some cases, but undefined behavior is still undefined).
If you need that guarantee, if you need to issue a rendering command that will fetch values written by image store/atomic updates, you must explicitly ask synchronize memory sometime after issuing the writing call and before issuing the reading call. This is done with glMemoryBarrier.
Therefore, if you render something that does image storing, you cannot render something that uses the stored data until an appropriate barrier has been sent (either explicitly in the shader or explicitly in OpenGL code). This could be an image load operation. But it could be rendering from a buffer object written by shader code. It could be a texture fetch. It could be doing blending to an image attached to the FBO. It doesn't matter; you can't do it.
Note that this applies for all operations that deal with image load/store/atomic, not just shader operations. So if you use image store to write to an image, you won't necessarily read the right data unless you use a GL_TEXTURE_UPDATE_BARRIER_BIT​ barrier.
I know this technically isn't supported (and as far as I can tell it's undefined behavior) but is it really a fatally horrible thing to sample from a texture which is also being written to?
I ask because I need to read from a depth texture which I also need to write to, if I can't do this it means I will have to copy the depth texture and if it isn't that big of a deal I don't see the harm in simply copying it?
Thanks for any help!
Yes, it's fatal and triggers undefined behaviour. Just make a copy and read from the copy.
The explanation is simple. Since fragments are processed in parallel in a unspecified order, you might be reading from already written texels or original value texels, and there is no way of knowing what you are reading. Making a copy and reading from it ensures that you will read the correct values.
Matias and Goz covered the most important bits. Let me add a couple interesting facts:
The Direct3D runtime actively unbinds textures when you bind their underlying resource as a render-target (so you can't create the cycle there).
UAVs in Direct3D 11 actually allow read-modify-write operations on a subset of the formats (the ones that do not require a type conversion). They do not guarantee any order of operation, though. This is what is being used by a number of algorithms that do Order-independent transparency, notably (where the re-ordering is done manually).