GLSL packing 4 float attributes into vec4 - opengl

I have a question about resource consumption of attribute float in glsl.
Does it take as many resources as vec4, or no?
I ask this, because uniforms takes https://stackoverflow.com/a/20775024/1559666 (at least, they could)
If it is not, then does it makes any sense to pack 4 float's into one vec4 attribute?

Yes, all vertex attributes require some multiple of a 4-component vector for storage.
This means that a float vertex attribute takes 1 slot the same as a vec2, vec3 or vec4 would. And types larger than vec4 take multiple slots. A mat4 vertex attribute takes 4 x vec4 many units of storage. A dvec4 (double-precision vector) vertex attribute takes 2 x vec4. Since implementations are only required to offer 16 unique vertex attribute slots, if you naively used single float attributes, you could easily exhaust all available storage just to store a 4x4 matrix.
There is no getting around this. Unlike uniforms (scalar GPUs may be able to store float uniforms more efficiently than vec4), attributes are always tied to a 4-component data type. So for vertex attributes, packing attributes into vectors is quite important.
I have updated my answer to point out relevant excerpts from the GL and GLSL specifications:
OpenGL 4.4 Core Profile Specification - 10.2.1 Current Generic Attributes - pp. 307
Vertex shaders (see section 11.1) access an array of 4-component generic vertex
attributes. The first slot of this array is numbered zero, and the size of the array is
specified by the implementation-dependent constant GL_MAX_VERTEX_ATTRIBS.
GLSL 4.40 Specification - 4.4.1 Input Layout Qualifiers - pp. 60
If a vertex shader input is any scalar or vector type, it will consume a single location. If a non-vertex shader input is a scalar or vector type other than dvec3 or dvec4, it will consume a single location, while types dvec3 or dvec4 will consume two consecutive locations. Inputs of type double and dvec2 will consume only a single location, in all stages.
Admittedly, the behavior described for dvec4 differs slightly. In GL_ARB_vertex_attrib_64bit form, double-precision types may consume twice as much storage as floating-point, such that a dvec3 or dvec4 may consume two attribute slots. When it was promoted to core, that behavior changed... they are only supposed to consume 1 location in the vertex stage, potentially more in any other stage.
Original (extension) behaviour of double-precision vector types:
Name
ARB_vertex_attrib_64bit
[...]
Additionally, some vertex shader inputs using the wider 64-bit components
may count double against the implementation-dependent limit on the number
of vertex shader attribute vectors. A 64-bit scalar or a two-component
vector consumes only a single generic vertex attribute; three- and
four-component "long" may count as two. This approach is similar to the
one used in the current GL where matrix attributes consume multiple
attributes.

The attribute vec4 will take 4 times the memory of the attribute float.
On uniforms, due to some alignments, you may loose some components. (vec4 will be aligned to 4 bytes).

Related

Properties of the stride in a glTF file?

The Khronos docs define the stride as:
When a buffer view is used for vertex attribute data, it may have a byteStride property. This property defines the stride in bytes between each vertex.
I am somewhat confused as, many of the examples I have tried already (3 of them) had a stride of 0 so I merely ignored the attribute until now. Those examples render just fine.
I was inferring the "stride" from the type. e.g. if the type was a vec3 and the component type was float, I loaded every 12 bytes as one element. Some things that I am not entirely sure about reading the spec are,
When stride is non 0, does this mean the data could be interleaved?
When the stride is non 0, can the data be non continuous (e.g. padding bytes)?
In other words, can you run into situations where the buffer is not interleaved but the total size of sizeof(type_component) * element_count is not a divisor of the total subsection of memory to be read?
Yes, accessors (in glTF) are like vertex attributes in OpenGL/WebGL, and are allowed to interleave. The stride is on the bufferView, to force accessors that share that bufferView to all have the same stride. A value of zero means "tightly packed".
Note that you may interleave components of different sizes, such as vec3 (POSITION) with vec2 (TEXCOORD_0), so a stride might be the sum of different sizes.
Here's a diagram from the Data Interleaving section of the glTF tutorial. It's a little small here but you can click for a larger view. In this example, there are two accessors, one for POSITION and one for NORMAL, sharing a single BufferView.

Vulkan: weird performance of uniform buffer

One of the inputs of my fragment shader is an array of 5 structures. The shader computes a color based on each of the 5 structures. In the end, these 5 colors are summed together to produce the final output. The total size of the array is 1440 bytes. To accommodate the alignment of the uniform buffer, the size of the uniform buffer changes to 1920 bytes.
1- If I define the array of 5 structures as a uniform buffer array, the rendering takes 5ms (measured by Nsight Graphics). The uniform buffer's memory property is 'VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT'. The uniform buffer in glsl is defined as follows
layout(set=0,binding=0) uniform UniformStruct { A a; } us[];
layout(location=0) out vec4 c;
void main()
{
vec4 col = vec4(0);
for (int i = 0; i < 5; i++)
col += func(us[nonuniformEXT(i)]);
c = col;
}
Besides, I'm using 'GL_EXT_nonuniform_qualifier' extension to access the uniform buffer array. This seems the most straightforward way for me but there are alternative implementations.
2- I can split the rendering from one vkCmdDraw to five vkCmdDraw, change the framebuffer's blend mode from overwriting to addition and define a uniform buffer instead of a uniform buffer array in the fragment shader. On the CPU side, I change the descriptor type from UNIFORM_BUFFER to UNIFORM_BUFFER_DYNAMICS. Before each vkCmdDraw, I bind the dynamic uniform buffer and the corresponding offsets. In the fragment shader, the for loop is removed. Although it seems that it should be slower than the first method, it is surprisingly much faster than the first method. The rendering only takes 2ms total for 5 draws.
3- If I define the array of 5 structures as a storage buffer and do one vkCmdDraw, the rendering takes only 1.4ms. In other words, if I change the array from the uniform buffer array to storage buffer but keep anything else the same as 1, it becomes faster.
4- If I define the array of 5 structures as a global constant in the glsl and do one vkCmdDraw, the rendering takes only 0.5ms.
In my opinion, 4 should be the fastest way, which is true in the test. Then 1 should be the next. Both 2 and 3 should be slower than 1. However, Neither 2 or 3 is slower than 1. In contrast, they are much faster than 1. Any ideas why using uniform buffer array slows down the rendering? Is it because it is a host visible buffer?
When it comes to UBOs, there are two kinds of hardware: the kind where UBOs are specialized hardware and the kind where they aren't. For GPUs where UBOs are not specialized hardware, a UBO is really just a readonly SSBO. You can usually tell the difference because hardware where UBOs are specialized will have different size limits on them from those of SSBOs.
For specialized hardware-based UBOs (which NVIDIA still uses, if I recall correctly), each UBO represents an upload from memory into a big block of constant data that all invocations of a particular shader stage can access.
For this kind of hardware, an array of UBOs is basically creating an array out of segments of this block of constant data. And some hardware has multiple blocks of constant data, so indexing then with non-constant expressions is tricky. This is why non-constant access to such indices is an optional feature of Vulkan.
By contrast, a UBO which contains an array is just one big UBO. It's special only in how big it is. Indexing through an array within a UBO is no different from indexing any array. There are no special rules with regard to the uniformity of the index of such accesses.
So stop using an array of UBOs and just use a single UBO which contains an array of data:
layout(set=0,binding=0) uniform UniformStruct { A a[5]; } us;
It'll also avoid additional padding due to alignment, additional descriptors, additional buffers, etc.
However, you might also speed things up by not lying to Vulkan. The expression nonuniformEXT(i) states that the expression i is not dynamically uniform. This is incorrect. Every shader invocation that executes this loop will generate i expressions that have values from 0 to 4. Every dynamic instance of the expression i for any invocation will have the same value at that place in the code as every other.
Therefore i is dynamically uniform, so telling Vulkan that it isn't is not helpful.

Confusion About glVertexAttrib... Functions

After a lot of searching, I still am confused about what the glVertexAttrib... functions (glVertexAttrib1d, glVertexAttrib1f, etc.) do and what their purpose is.
My current understanding from reading this question and the documentation is that their purpose is to somehow set a vertex attribute as constant (i.e. don't use an array buffer). But the documentation also talks about how they interact with "generic vertex attributes" which are defined as follows:
Generic attributes are defined as four-component values that are organized into an array. The first entry of this array is numbered 0, and the size of the array is specified by the implementation-dependent constant GL_MAX_VERTEX_ATTRIBS. Individual elements of this array can be modified with a glVertexAttrib call that specifies the index of the element to be modified and a value for that element.
It says that they are all "four-component values", yet it is entirely possible to have more or less components than that in a vertex attribute.
What is this saying exactly? Does this only work for vec4 types? What would be the index of a "generic vertex attribute"? A clear explanation is probably what I really need.
In OpenGL, a vertex is specified as a set of vertex attributes. With the advent of the programmable pipleine, you are responsible for writing your own vertex processing functionality. The vertex shader does process one vertex, and gets this specific vertex' attributes as input.
These vertex attributes are called generic vertex attributes, since their meaning is completely defined by you as the application programmer (in contrast to the legacy fixed function pipeline, where the set of attributes were completely defined by the GL).
The OpenGL spec requires implementors to support at least 16 different vertex attributes. So each vertex attribute can be identified by its index from 0 to 15 (or whatever limit your implementation allows, see glGet(GL_MAX_VERTEX_ATTRIBS,...)).
A vertex attribute is conceptually treated as a four-dimensional vector. When you use less than vec4 in a shader, the additional elements are just ignored. If you specify less than 4 elements, the vector is always filled to the (0,0,0,1), which makes sense for both RGBA color vectors, as well as homogenous vertex coordinates.
Though you can declare vertex attributes of mat types, this will just be mapped to a number of consecutive vertex attribute indices.
The vertex attribute data can come from either a vertex array (nowadays, these are required to lie in a Vertex Buffer Object, possibly directly in VRAM, in legacy GL, they could also come from the ordinary client address space) or from the current value of that attribute.
You enable the fetching from attribute arrays via glEnableVertexAttribArray.If a vertex array for a particular attribute you access in your vertex shader is enabled, the GPU will fetch the i-th element from that arry when processing vertex i. FOr all other attributes you access, you will get the current value for that array.
The current value can be set via the glVertexAttrib[1234]* family of GL functions. They cannot be changed durint the draw call, so they remain constant during the whole draw call - just like uniform variables.
One important thing worth noting is that by default, vertex attributes are always floating point, ad you must declare in float/vec2/vec3/vec4 in the vertex shader to acces them. Setting the current value with for example glVertexAttrib4ubv or using GL_UNISGNED_BYTE as the type parameter of glVertexAttribPointer will not change this. The data will be automatically converted to floating-point.
Nowadays, the GL does support two other attribute data types, though: 32 bit integers, and 64 bit double precision floating-point values. YOu have to declare them as int/ivec*, uint/uvec* or double/dvec* respectively in the shader, and you have to use completely separate functions when setting up the array pointer or current values: glVertexAttribIPointer and glVertexAttribI* for signed/unsigned integers and
glVertexAttribLPointer and glVertexAttribL* for doubles ("long floats").

How exactly does imageAtomicExchange work?

I have a texture of vec4's which is being modified by the compute shader.
Different invocations of the compute shader modify different components of the same vector and this seems to be causing some concurrency problems as currently my method for doing so is:
vec4 texelData = imageLoad(Texture, texCoords);
//Do operations on individual components of texelData based on the invocation id
imageStore(Texture, texCoords, texelData);
I imagine what happens here is that different invocations are getting the original state of texelData which would be all 0's, writing their bit to it, then storing it which means only the component modified by the last invocation to finish will be present at the end.
So I'm looking into using imageAtomicExchange which should do this atomically, therefore eliminating the concurrency problems, however I cannot get it to work.
The spec says that the arguments are:
The image2D to write to
The vec2 coordinates in the image to write to
A float.. which I don't understand?
Would it not be a vec4 as that is what is present at each texel? And if not shouldn't there be another argument or an extra dimension of the coordinate vector to specify which component of the texel to swap?

Glsl Matrix Registers?

I remember reading that mat4x3 took more registers, as it has four columns, than a mat3x4 even though they have the same number of elements. I can't seem to find this anywhere anymore. Does the new spec use the same number of uniform registers for both types of matrices?
Is there any performance hit with transpose as well?
mat3x4 a; // transposed mat4x3
result = transpose(a) * vec4(val, 1); // keeps order
result = vec4(val, 1) * a; // better performance?
Assuming (i think) they do the samething.
In the latest (4.4) spec document, uniform limits are expressed in components. Talking about MAX_FRAGMENT_UNIFORM_COMPONENTS and equivalent limits for other shader stages that can be queried with glGetIntegerv():
These values represent the numbers of individual floating-point, integer, or boolean values that can be held in uniform variable storage for a shader.
Then later, about matrices:
A matrix uniform will consume no more than 4 × min(r,c) components, where r and c are the number of rows and columns in the matrix.
So for mat3x4 and mat4x3, the minimum dimension is 3 both times, so the number of components consumed will be 12 for both.
The older style capacity queries are still there, but the definition looks like they were mainly maintained for backwards compatibility:
The implementation-dependent constants MAX_VERTEX_UNIFORM_VECTORS and MAX_FRAGMENT_UNIFORM_VECTORS have values respectively equal to the values of MAX_VERTEX_UNIFORM_COMPONENTS and MAX_FRAGMENT_UNIFORM_COMPONENTS divided by four.
Note that attributes, unlike uniforms, are still vector oriented. According to table 11.2, a mat3x4 consumes 3 attribute slots, while a mat4x3 consumes 4 attribute slots.