I'm using sparse textures with OpenGL 4.4 on Win8.1 latest NV driver as of the date of writing. Everything seems to work fine with regular copied into the committed regions. However when I try to do shader imageLoad/imageStore operations on a sparse texture (with mixed commited/uncommited regions) the texture gets messed up all over the place (values look chaotic or like random mem content).
The extension specs (https://www.opengl.org/registry/specs/ARB/sparse_texture.txt) state that all shader and client side read to uncomitted regions are undefined and write are discarded. However I cannot find any explicit mentions of imageLoad imageStore anywhere. It does mention FBO attachments though (which I want to avoid since I'm using compute shaders).
What's the proper behavior for sparse texture with regards to image load/store?
"Writes to such regions are ignored. The GL may attempt to write to uncommitted regions but the effect of doing so will be benign."
Evidently the definition of "benign" is up for debate unless you are discussing imageLoad (...). Attempting to store something will not produce random garbage, but reading is very clearly undefined:
"Reads from such regions produce undefined data, but otherwise have no adverse effect."
I would like to take this opportunity to point out, however, that GL_ARB_sparse_texture is functionally incomplete. Many of these things that are undefined in that extension are properly handled given a pair of supplemental extensions.
Think of this like Direct3D 11.2's tiled resources - there are multiple tiers of support depending on hardware capabilities. The ARB extension you are working with here is the minimally functional tier and the more advanced tier is implemented through the following two extensions:
GL_ARB_sparse_texture2
GL_ARB_sparse_texture_clamp
The scenarios you are discussing have well-defined behavior if you read up on Extension #1.
Overview
This extension builds on the ARB_sparse_texture extension, providing the
following new functionality:
New built-in GLSL texture lookup and image load functions are provided
that return information on whether the texels accessed for the texture
lookup accessed uncommitted texture memory.
New built-in GLSL texture lookup functions are provided that specify a
minimum level of detail to use for lookups where the level of detail
is computed automatically. This allows shaders to avoid accessing
unpopulated portions of high-resolution levels of detail when it knows
that the memory accessed is unpopulated, either from a priori
knowledge or from feedback provided by the return value of previously
executed "sparse" texture lookup functions.
Reads of uncommitted texture memory will act as though such memory
were filled with zeroes; previously, the values returned by reads were
undefined.
Extension #2 is probably of no interest to you since you are dealing with compute shaders.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.
Where can I see list of opengl commands like glBindXXX sorted by execution cost.
For example that list must gives the answer on questions:
What is more cost: change a texture or a shader?
What is more cost: change a shader or a vertexbuffer?
etc.
Like #datenwolf wrote, it's highly dependent on implementation/hardware, but here's a link to a presentation from 2014 that has a table of relative costs in decreasing order (page 48):
http://www.slideshare.net/CassEveritt/beyond-porting
Render target > Program > ROP > Texture binding > Vertex format > UBO > Vertex bindings > Uniform updates
Where can I see list of opengl commands like glBindXXX sorted by execution cost.
Nowhere, because such a list doesn't exist. OpenGL is just a specification and every implementation may behave completely different compared to every other implementation.
And the costs of state changes depend entirely on the actual implementation. That being said there are a few rules of thumb:
Operations that cool down caches are the most expensive ones to carry out. So switching a texture (and then use it for actual drawing) is quite costly; just binding a different texture and the binding another one with out doing anything with the texture however may be cheap or not.
Note that some OpenGL implementations (notably the proprietary AMD and NVidia ones) even go as far as collecting statistics and runtime profiles of the process calling into them to apply heuristics to optimize the runtime behavior.
The OpenGL tradition is to let the user manipulate OpenGL objects using an unsigned int handle. Why not just give a pointer instead? What are the advantages of unique IDs over pointers?
TL;DR: OpenGL IDs don't map bijectively to memory locations. A single OpenGL ID may refer to multiple memory locations at the same time. Also OpenGL has been designed to work for distributed rendering architectures (like X11) as well, and given an indirect context programs running on different machines may use the same OpenGL context.
OpenGL has been designed as an architecture and display system agnostic API. When OpenGL was first developed this happened in light of client-server display architectures (like X11). If you look into the OpenGL specification, even of modern OpenGL-4 it refers to clients and servers.
However in a client/server architectures pointers make no sense. For one the address space of the server is not accessible to the clients without jumping some hoops. And even if you set up a shared memory mapping, the addresses of objects are not the same for client and server. Add to this that on architectures like X11 a single indirect OpenGL context can be used by multiple clients, that may even run on different machines. Pointers simply don't work for that.
Last but not least the OpenGL object model is highly abstract and the OpenGL drawing model is asynchonous Say I do the following:
id = glGenTextures(1)
glBindTexture(id)
glTexStorage(…)
glTexSubImage(image_a)
draw_something()
glTexSubImage(image_b)
draw_someting_b()
When the end of this little snippet has reached, actually nothing at all may have been drawn yet, because no synchronization point has been reached (glFinish, glReadPixels, a buffer swap). Note the two calls to glTexSubImage, which happen on the same id. When the pixels are finally put to the framebuffer, there two different images to be sourced from a single texture ID, because OpenGL guarantees you, that things will appear as if things were drawn synchronously. So at the end of a drawing batch a single object ID may refer to a whole collection of different data sets with different locations in memory.
My first consideration - having pointers would make programmers wonder if they can operate with them in a pointer-arithmetic way, e.g. by pointing to a middle of a texture to update it or something like that. Maybe even more crazy things, such as patching shaders code on-the-fly. That all sounds like a whole new cool degree of freedom, unless you think of additional complications caused by tampering with highly efficient and optimized GPU "black-box" way of operation.
For example - consider inner workings of GPU memory allocation. Just like with OS - pointers you get from OS are not the real "physical" ones, OS memory manager can move things around behind the scenes while keeping the pointers the same (f.e. swapping to HDD). In that case IDs are just the same - GPU can optimize and pack entities with even more freedom, while keeping the nice facade of them being available at 1-2-3.
Another example - OpenGL is not actually the same across manufacturers. In fact OpenGL is just a description of API, where each vendor can make his own implementation the way it works best for him. For example there's no rule on hot to store texture mipmaps, aligned, or interleaved or whatever. Having pointers to a texture would lure developers into tampering with mipmaps, which would cause a lot of trouble to support various implementations or force all the implementations to become strictly unified, which again is a bad idea for performance.
The OpenGL device (GPU) may have its own memory with its own address space, independent of the host (CPU) memory system. (Think of a discrete video card with its own onboard RAM.) The host can't (directly) access that memory, so it's not possible to have a pointer to it.
It's best to think of the GPU as a whole separate computer; it's actually possible to do OpenGL over a network, with a program running on one computer rendering graphics on the video card in another. When you set up your textures and buffers, you're basically uploading data to the GL device for its own internal use.
ARB_texture_storage was introduced into OpenGL 4.2 core.
Can you explain what immutability for texture objects means?
Why it is better from the previous texture usage and what are disadvantages of this feature?
I know I can read the spec of this extension (which I did :)), but I would like to see some examples or other explanation.
Just read the introduction from the extension itself:
The texture image specification commands in OpenGL allow each level
to be separately specified with different sizes, formats, types and
so on, and only imposes consistency checks at draw time. This adds
overhead for implementations.
This extension provides a mechanism for specifying the entire
structure of a texture in a single call, allowing certain
consistency checks and memory allocations to be done up front. Once
specified, the format and dimensions of the image array become
immutable, to simplify completeness checks in the implementation.
When using this extension, it is no longer possible to supply texture
data using TexImage*. Instead, data can be uploaded using TexSubImage*,
or produced by other means (such as render-to-texture, mipmap generation,
or rendering to a sibling EGLImage).
This extension has complicated interactions with other extensions.
The goal of most of these interactions is to ensure that a texture
is always mipmap complete (and cube complete for cubemap textures).
The obvious advantages are that the implementation can remove completeness / consistency checks at runtime, and your code is more robust because you can't accidentally create a wrong texture.
To elaborate: "immutable" here means that the texture storage (one of the three components of a texture: storage, sampling, parameters) gets allocated once and it's already complete. Note that storage doesn't mean the storage contents -- they can change at any time; it refers to the logical process of acquiring resources for those contents (like, a malloc).
With non-immutable textures, you can change the storage at any time, by means of glTexImage<N>D calls. There are many many ways of shooting yourself in the foot this way:
you may create mipmap-incomplete textures (probably the most common newbie error with textures, as textures by default have 1000 mipmap levels, and people upload only one image)
you may create textures with different formats in different mipmap levels (illegal)
you may create cubemap-incomplete cubemaps (illegal)
you may create cubemaps with different formats in different faces (illegal)
Since you're allowed to call glTexImage<N>D at any time, the implementation must always check, at draw time, that your texture is legal. Immutable storage always does the right thing for you by allocating everything in one go (all mipmap levels, all cubemap faces, etc.) with the right format. So you can't screw up a texture (easily) any more, and the implementation can remove some checks, which speeds things up. And everybody is happy :)
The GLSL specification states, for the "coherent" memory qualifier: "memory variable where reads and writes are coherent with reads and writes from other shader invocations".
In practice, I'm unsure how this is interpreted by modern-day GPU drivers with regards to multiple rendering passes. When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes? For my purposes, I define a pass as a "glBindFramebuffer-glViewPort-glUseProgram-glDrawABC-glDrawXYZ-glDraw123" cycle; where I'm currently executing 2 such passes per "render loop iteration" but may have more per iteration later on.
When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes?
It means exactly what it says: "other shader invocations". It could be the same program code. It could be different code. It doesn't matter: shader invocations that aren't this one.
Normally, OpenGL handles synchronization for you, because OpenGL can track this fairly easily. If you map a range of a buffer object, modify it, and unmap it, OpenGL knows how much stuff you've (potentially) changed. If you use glTexSubImage2D, it knows how much stuff you changed. If you do transform feedback, it can know exactly how much data was written to the buffer.
If you do transform feedback into a buffer, then bind it as a source for vertex data, OpenGL knows that this will stall the pipeline. That it must wait until the transform feedback has completed, and then clear some caches in order to use the buffer for vertex data.
When you're dealing with image load/store, you lose all of this. Because so much could be written in a completely random, unknown, and unknowable fashion, OpenGL generally plays really loose with the rules in order to allow you flexibility to get the maximum possible performance. This triggers a lot of undefined behavior.
In general, the only rules you can follow are those outlined in section 2.11.13 of the OpenGL 4.2 specification. The biggest one (for shader-to-shader talk) is the rule on stages. If you're in a fragment shader, it is safe to assume that the vertex shader(s) that specifically computed the point/line/triangle for your triangle have completed. Therefore, you can freely load values that were stored by them. But only from the ones that made you.
Your shaders cannot make assumptions that shaders executed in previous rendering commands have completed (I know that sounds odd, given what was just said, but remember: "only from the ones that made you"). Your shaders cannot make assumptions that other invocations of the same shader, using the same uniforms, images, textures, etc, in the same rendering command have completed, except where the above applies.
The only thing you can assume is that writes your shader instance itself made are visible... to itself. So if you do an imageStore and do an imageLoad to the same memory location through the same variable, then you are guaranteed to get the same value back.
Well, unless someone else wrote to it in the meantime.
Your shaders cannot assume that a later rendering command will certainly fetch values written (via image store or atomic updates) by a previous one. No matter how much later! It doesn't matter what you've bound to the context. It doesn't matter what you've uploaded or downloaded (technically. Odds are you'll get correct behavior in some cases, but undefined behavior is still undefined).
If you need that guarantee, if you need to issue a rendering command that will fetch values written by image store/atomic updates, you must explicitly ask synchronize memory sometime after issuing the writing call and before issuing the reading call. This is done with glMemoryBarrier.
Therefore, if you render something that does image storing, you cannot render something that uses the stored data until an appropriate barrier has been sent (either explicitly in the shader or explicitly in OpenGL code). This could be an image load operation. But it could be rendering from a buffer object written by shader code. It could be a texture fetch. It could be doing blending to an image attached to the FBO. It doesn't matter; you can't do it.
Note that this applies for all operations that deal with image load/store/atomic, not just shader operations. So if you use image store to write to an image, you won't necessarily read the right data unless you use a GL_TEXTURE_UPDATE_BARRIER_BIT barrier.