When do I use the STD140 for uniform blocks in OpenGL?
Although I am not a 100% sure, I believe there is an alternative to it which can achieve the same thing, called "Shared".
Is it just preference for the coder? Or are there reasons to use one over the other?
Uniform buffer objects are described in http://www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt
The data storage for a uniform block can be declared to use one of three layouts in memory: packed, shared, or std140.
packed uniform blocks have an implementation-dependent data layout for efficiency, and unused uniforms may be eliminated by the compiler to save space.
shared uniform blocks, the default layout, have an implementation-dependent data layout for efficiency, but the layout will be uniquely determined by the structure of the block, allowing data storage to be shared across programs.
std140 uniform blocks have a standard cross-platform cross-vendor layout. Unused uniforms will not be eliminated.
The std140 uniform block layout, which guarantees specific packing behavior and does not require the application to query for offsets and strides. In this case, the minimum size may still be queried, even though it is determined in advance based only on the uniform block declaration. The offset of each uniform in a uniform block can be derived from the definition of the uniform block by applying the set of rules described in the OpenGL specification.
std140 is most useful when you have a uniform block that you update all at once, for example a collection of matrix and lighting values for rendering a scene. Declare the block with std140 in your shader(s), and you can replicate the memory layout in C with a struct. Instead of having to query and save the offsets for every individual value within the block from C, you can just glBufferData(GL_UNIFORM_BUFFER, sizeof(my_struct), &my_struct, with one call.
You do need to be a little careful with alignment in C, for instance, a vec3 will take up 4 floats, not 3, but it is still much easier IMHO.
Related
One of the inputs of my fragment shader is an array of 5 structures. The shader computes a color based on each of the 5 structures. In the end, these 5 colors are summed together to produce the final output. The total size of the array is 1440 bytes. To accommodate the alignment of the uniform buffer, the size of the uniform buffer changes to 1920 bytes.
1- If I define the array of 5 structures as a uniform buffer array, the rendering takes 5ms (measured by Nsight Graphics). The uniform buffer's memory property is 'VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT'. The uniform buffer in glsl is defined as follows
layout(set=0,binding=0) uniform UniformStruct { A a; } us[];
layout(location=0) out vec4 c;
void main()
{
vec4 col = vec4(0);
for (int i = 0; i < 5; i++)
col += func(us[nonuniformEXT(i)]);
c = col;
}
Besides, I'm using 'GL_EXT_nonuniform_qualifier' extension to access the uniform buffer array. This seems the most straightforward way for me but there are alternative implementations.
2- I can split the rendering from one vkCmdDraw to five vkCmdDraw, change the framebuffer's blend mode from overwriting to addition and define a uniform buffer instead of a uniform buffer array in the fragment shader. On the CPU side, I change the descriptor type from UNIFORM_BUFFER to UNIFORM_BUFFER_DYNAMICS. Before each vkCmdDraw, I bind the dynamic uniform buffer and the corresponding offsets. In the fragment shader, the for loop is removed. Although it seems that it should be slower than the first method, it is surprisingly much faster than the first method. The rendering only takes 2ms total for 5 draws.
3- If I define the array of 5 structures as a storage buffer and do one vkCmdDraw, the rendering takes only 1.4ms. In other words, if I change the array from the uniform buffer array to storage buffer but keep anything else the same as 1, it becomes faster.
4- If I define the array of 5 structures as a global constant in the glsl and do one vkCmdDraw, the rendering takes only 0.5ms.
In my opinion, 4 should be the fastest way, which is true in the test. Then 1 should be the next. Both 2 and 3 should be slower than 1. However, Neither 2 or 3 is slower than 1. In contrast, they are much faster than 1. Any ideas why using uniform buffer array slows down the rendering? Is it because it is a host visible buffer?
When it comes to UBOs, there are two kinds of hardware: the kind where UBOs are specialized hardware and the kind where they aren't. For GPUs where UBOs are not specialized hardware, a UBO is really just a readonly SSBO. You can usually tell the difference because hardware where UBOs are specialized will have different size limits on them from those of SSBOs.
For specialized hardware-based UBOs (which NVIDIA still uses, if I recall correctly), each UBO represents an upload from memory into a big block of constant data that all invocations of a particular shader stage can access.
For this kind of hardware, an array of UBOs is basically creating an array out of segments of this block of constant data. And some hardware has multiple blocks of constant data, so indexing then with non-constant expressions is tricky. This is why non-constant access to such indices is an optional feature of Vulkan.
By contrast, a UBO which contains an array is just one big UBO. It's special only in how big it is. Indexing through an array within a UBO is no different from indexing any array. There are no special rules with regard to the uniformity of the index of such accesses.
So stop using an array of UBOs and just use a single UBO which contains an array of data:
layout(set=0,binding=0) uniform UniformStruct { A a[5]; } us;
It'll also avoid additional padding due to alignment, additional descriptors, additional buffers, etc.
However, you might also speed things up by not lying to Vulkan. The expression nonuniformEXT(i) states that the expression i is not dynamically uniform. This is incorrect. Every shader invocation that executes this loop will generate i expressions that have values from 0 to 4. Every dynamic instance of the expression i for any invocation will have the same value at that place in the code as every other.
Therefore i is dynamically uniform, so telling Vulkan that it isn't is not helpful.
I've recently replaced all usage of glUniform() in my code to instead use Uniform Buffer Objects (UBO) to better organize my object rendering code. However, my performance has sharply plummeted as the number of objects drawn increases (an object in my case is anything with it's own offset into a UBO; only one UBO is shared between everything that uses the same shader program).
I think this may be because my objects actually use very little uniform data, sometimes as little as one vec4, while a UBO requires each offset be in multiples of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT which can be rather large.
I understood UBO to be a strict upgrade to using glUniform(), am I mistaken? Should I have a glUniform() path for these simple objects, or is there a necessary step to using UBOs effectively with many different simple objects that I haven't taken?
My scene is mostly quads, with very few objects using more than 4 verts, 6 indices. Some of these quads have large uniform blocks while most have very simple ones, maybe one or two vec4.
(For now, I've gotten around this problem by embedding this object data into the vertices, though it definitely feels wrong to do it this way.)
For example, I have code like this:
uniform struct MyStruct {
mat4 model;
mat4 view;
mat4 projection;
float f1;
vec4 v1;
}, myStructs[4];
Can I be sure that location of myStructs[1].projection is location of myStructs[0].projection + 5?
I didn't find the exact information about this on khronos.org, but I found some blurry statement:
struct Thingy
{
vec4 an_array[3];
int foo;
};
layout(location = 2) uniform Thingy some_thingies[6];
Each Thingy takes up 4 uniform locations; the first three going to
an_array and the fourth going to foo. Thus, some_thingies takes up 24
uniform locations.
It isn't clear here whether locations one after another. Perhaps about this somewhere is said more accurately?
Unless you explicitly specify the location of the uniform variable, the locations of arrays of non-basic types are not strictly defined, relative to the location of any particular member of that array. So you must query the location of every member of every array/struct that you use.
Or just explicitly specify the location with layout(location). That's a much easier option; those are explicitly required to allocate their locations sequentially. And for bonus points, you don't have to query anything.
Your first example uniform struct MyStruct is a raw uniform whose member locations are arbitrary and must be queried:
Uniform locations are unique to a specific program. If you do not explicitly assign a uniform to a location (via the OpenGL 4.3 or ARB_explicit_uniform_location feature mentioned above), then OpenGL will assign them arbitrarily.
Your second example layout(location = 2) uniform Thingy some_thingies[6]; is defining a uniform block to which the following memory layout applies:
Quote Memory layout:
The specific size of basic types used by members of buffer-backed blocks is defined by OpenGL. However, implementations are allowed some latitude when assigning padding between members, as well as reasonable freedom to optimize away unused members. How much freedom implementations are allowed for specific blocks can be changed.
There are four memory layout qualifiers: shared, packed, std140, and std430. Defaults can be set the same as for matrix ordering (eg: layout(packed) buffer; sets all shader storage buffer blocks to use packed). The default is shared.
So it seems that they are sequential in memory, but as t.niese points out: only std140 and std430 provide you with those guarantees (note that std430 can only be used with shader storage blocks, not uniform blocks). Since the default layout is shared some parts of your uniform might have been optimised out or padded differently, depending on your driver.
Use glGetUniformLocation to query each location of the members separately:
Uniform variables that are structures or arrays of structures may be queried by calling glGetUniformLocation for each field within the structure. The array element operator "[]" and the structure field operator "." may be used in name in order to select elements within an array or fields within a structure. The result of using these operators is not allowed to be another structure, an array of structures, or a subcomponent of a vector or a matrix. Except if the last part of name indicates a uniform variable array, the location of the first element of an array can be retrieved by using the name of the array, or by using the name appended by "[0]".
Is it guaranteed that if a uniform block is declared the same in multiple shader programs, say
uniform Matrices
{
mat4 ProjectionMatrix;
mat4 CameraMatrix;
mat4 ModelMatrix;
};
Will it have the same block index returned by glGetUniformBlockIndex(program, "Matrices")?
If the answer is yes, then I'm able to query the index of the block once and use it for all the shader programs that contain that block, right?
Second question: will ProjectionMatrix, CameraMatrix, ModelMatrix, always have the same layout order in memory, respectively? I'm asking this because the tutorial I read uses the next functions
// Query for the offsets of each block variable
const GLchar *names[] = { "InnerColor", "OuterColor",
"RadiusInner", "RadiusOuter" };
GLuint indices[4];
glGetUniformIndices(programHandle, 4, names, indices);
GLint offset[4];
glGetActiveUniformsiv(programHandle, 4, indices,
GL_UNIFORM_OFFSET, offset);
And I'm not sure if that's really needed, as long as I know the uniforms order inside the uniform block..?
will ProjectionMatrix, CameraMatrix, ModelMatrix, always have the same layout order in memory, respectively?
No. Here's what the standard states (emphasis mine):
If pname is UNIFORM_BLOCK_DATA_SIZE, then the implementation-
dependent minimum total buffer object size, in basic machine units, required to hold all active uniforms in the uniform block identified by uniformBlockIndex is returned. It is neither guaranteed nor expected that a given implementation will arrange uniform values as tightly packed in a buffer object. The exception to this is the std140 uniform block layout, which guarantees specific packing behavior and does not require the application to query for offsets and strides.
I'm not sure if that's really needed, as long as I know the uniforms order inside the uniform block..?
So, yes, the author is right in not assuming the layout is contiguous and does what's sensible (guaranteed to work always in all implementations): gets the uniform indices and assigns their values respectively.
Specifying layout(std140) will do the trick then, right?
Yes, you can avoid querying the location and uploading data every time by making use of both uniform buffer objects and std140. However, make sure you understand its alignment requirements. This information is detailed in ARB_uniform_buffer_object's specification. For an elaborate treatment with examples see OpenTK's article on Uniform Buffer Objects (UBO) using the std140 layout specification.
Is it guaranteed that if a uniform block is declared the same in multiple shader programs, will it have the same block index returned by glGetUniformBlockIndex(program, "Matrices")?
No. I've searched the OpenGL 3.3 specification which gives no such guarantees. From the standard's viewpoint, uniform blocks (default or named) are associated to a program, period. No existence/association of uniform blocks beyond a program is made in the specification.
Because there is no guarantee that uniform blocks will have the same index in different shader program, that means I need to call glBindBufferBase() everytime I switch programs, right?
Yes, see ARB_uniform_buffer_object's specification for an example.
I want to render a scene to texture, and share the texture sampler in several programs. Similar to share the project view matrix in multiple programs. Unlike project view matrix, which can be put into uniform blocks, "Samplers cannot be part of uniform blocks". https://www.opengl.org/wiki/Uniform_(GLSL)
Some discussion here describes why:
multiple issues:
GLSL samplers embody two things: source of texture data and how to filter from it (nearest, linear, mipmapping, anisotropic, etc)
Samples are opaque things, and to place them into a uniform buffer object requires they have a well defined (and cross-platform) size.
Those issues together make it dicey.
I want to see some explanation of why this cannot be handled and any other ways to achieve sharing texture samplers.
The explanation is right here:
Samples are opaque things, and to place them into a uniform buffer object requires they have a well defined (and cross-platform) size.
The purpose of uniform blocks is, that you can set then with a single OpenGL call from a single structure in your (client) program. But for this to work the memory layout of this structure must be known, so that the memory layout produced by the compiler matches that of the shader uniform block.
But since the memory layout of samplers is not defined a memory layout can not be determined. Without a definitive memory layout no structure, without a structure no block.