What use has the layout specifier scalar in EXT_scalar_block_layout? - c++

Question
What use has the scalar layout specifier when accessing a storage buffer in GL_EXT_scalar_block_layout? (see below for example)
What would be use case for scalar?
Background
I recently programmed a simple Raytracer using Vulkan and NVidias VkRayTracing extension and was following this tutorial. In the section about the closest hit shader it is required to access some data that's stored in, well storage buffers (with usage flags vk::BufferUsageFlagBits::eStorageBuffer).
In the shader the extension GL_EXT_scalar_block_layout is used and those buffers are accessed like this:
layout(binding = 4, set = 1, scalar) buffer Vertices { Vertex v[]; } vertices[];
When I first used this code the validation layers told me that the structs like Vertex had an invalid layout, so I changed them to have each member aligned on 16byte blocks:
struct Vertex {
vec4 position;
vec4 normal;
vec4 texCoord;
};
with the corresponding struct in C++:
#pragma pack(push, 1)
struct Vertex {
glm::vec4 position_1unused;
glm::vec4 normal_1unused;
glm::vec4 texCoord_2unused;
};
#pragma pack(pop)
Errors disappeared and I got a working Raytracer. But I still don't understand why the scalar keyword is used here. I found this document talking about the GL_EXT_scalar_block_layout-extension, but I really don't understand it. Probably I'm just not used to glsl terminology? I can't see any reason why I would have to use this.
Also I just tried to remove the scalar and it still worked without any difference, warnings or erros whatsoever. Would be grateful for any clarification or further resources on this topic.

The std140 and std430 layouts do quite a bit of rounding of the offsets/alignments sizes of objects. std140 basically makes any non-scalar type aligned to the same alignment as a vec4. std430 relaxes that somewhat, but it still does a lot of rounding up to a vec4's alignment.
scalar layout means basically to layout the objects in accord with their component scalars. Anything that aggregates components (vectors, matrices, arrays, and structs) does not affect layout. In particular:
All types are sized/aligned only to the highest alignment of the scalar components that they actually use. So a struct containing a single uint is sized/aligned to the same size/alignment as a uint: 4 bytes. Under std140 rules, it would have 16-byte size and alignment.
Note that this layout makes vec3 and similar types actually viable, because C and C++ would then be capable of creating alignment rules that map to those of GLSL.
The array stride of elements in the array is based solely on the size/alignment of the element type, recursively. So an array of uint has an array stride of 4 bytes; under std140 rules, it would have a 16-byte stride.
Alignment and padding only matter for scalars. If you have a struct containing a uint followed by a uvec2, in std140/430, this will require 16 bytes, with 4 bytes of padding after the first uint. Under scalar layout, such a struct only takes 12 bytes (and is aligned to 4 bytes), with the uvec2 being conceptually misaligned. Padding therefore only exists if you have smaller scalars, like a uint16 followed by a uint.
In the specific case you showed, scalar layout was unnecessary since all of the types you used are vec4s.

Related

Properties of the stride in a glTF file?

The Khronos docs define the stride as:
When a buffer view is used for vertex attribute data, it may have a byteStride property. This property defines the stride in bytes between each vertex.
I am somewhat confused as, many of the examples I have tried already (3 of them) had a stride of 0 so I merely ignored the attribute until now. Those examples render just fine.
I was inferring the "stride" from the type. e.g. if the type was a vec3 and the component type was float, I loaded every 12 bytes as one element. Some things that I am not entirely sure about reading the spec are,
When stride is non 0, does this mean the data could be interleaved?
When the stride is non 0, can the data be non continuous (e.g. padding bytes)?
In other words, can you run into situations where the buffer is not interleaved but the total size of sizeof(type_component) * element_count is not a divisor of the total subsection of memory to be read?
Yes, accessors (in glTF) are like vertex attributes in OpenGL/WebGL, and are allowed to interleave. The stride is on the bufferView, to force accessors that share that bufferView to all have the same stride. A value of zero means "tightly packed".
Note that you may interleave components of different sizes, such as vec3 (POSITION) with vec2 (TEXCOORD_0), so a stride might be the sum of different sizes.
Here's a diagram from the Data Interleaving section of the glTF tutorial. It's a little small here but you can click for a larger view. In this example, there are two accessors, one for POSITION and one for NORMAL, sharing a single BufferView.

Correct struct layout in GLSL bindless texture handles

I've been trying to use the following code to do a global list of bindless texture handles, sent to the GPU using a UBO.
struct Material
{
sampler2D diff;
sampler2D spec;
sampler2D norm;
};
layout(std140, binding = 2) uniform Materials
{
Material materials[64];
};
However, I think I am filling in the buffer wrong in c++, not taking into account the correct offsets etc. I can't seem to find anything on how the std140 layout handles sampler2D. How should I be doing this? What offsets do I need to take into account?
There's nothing special about handles in this regard. The standard says:
If the member is a scalar consuming N basic machine units, the base align-
ment is N.
Samplers are effectively 64-bit integers as far as being "scalars" are concerned. So the base alignment of those members is 64-bit integers. But that's not really relevant, because in std140, the alignment of a struct is always rounded up to the size of a vec4. So that struct will take up 32 bytes.

GLSL array of struct members locations

For example, I have code like this:
uniform struct MyStruct {
mat4 model;
mat4 view;
mat4 projection;
float f1;
vec4 v1;
}, myStructs[4];
Can I be sure that location of myStructs[1].projection is location of myStructs[0].projection + 5?
I didn't find the exact information about this on khronos.org, but I found some blurry statement:
struct Thingy
{
vec4 an_array[3];
int foo;
};
layout(location = 2) uniform Thingy some_thingies[6];
Each Thingy takes up 4 uniform locations; the first three going to
an_array and the fourth going to foo. Thus, some_thingies takes up 24
uniform locations.
It isn't clear here whether locations one after another. Perhaps about this somewhere is said more accurately?
Unless you explicitly specify the location of the uniform variable, the locations of arrays of non-basic types are not strictly defined, relative to the location of any particular member of that array. So you must query the location of every member of every array/struct that you use.
Or just explicitly specify the location with layout(location). That's a much easier option; those are explicitly required to allocate their locations sequentially. And for bonus points, you don't have to query anything.
Your first example uniform struct MyStruct is a raw uniform whose member locations are arbitrary and must be queried:
Uniform locations are unique to a specific program. If you do not explicitly assign a uniform to a location (via the OpenGL 4.3 or ARB_explicit_uniform_location feature mentioned above), then OpenGL will assign them arbitrarily.
Your second example layout(location = 2) uniform Thingy some_thingies[6]; is defining a uniform block to which the following memory layout applies:
Quote Memory layout:
The specific size of basic types used by members of buffer-backed blocks is defined by OpenGL. However, implementations are allowed some latitude when assigning padding between members, as well as reasonable freedom to optimize away unused members. How much freedom implementations are allowed for specific blocks can be changed.
There are four memory layout qualifiers: shared, packed, std140, and std430. Defaults can be set the same as for matrix ordering (eg: layout(packed) buffer; sets all shader storage buffer blocks to use packed). The default is shared.
So it seems that they are sequential in memory, but as t.niese points out: only std140 and std430 provide you with those guarantees (note that std430 can only be used with shader storage blocks, not uniform blocks). Since the default layout is shared some parts of your uniform might have been optimised out or padded differently, depending on your driver.
Use glGetUniformLocation to query each location of the members separately:
Uniform variables that are structures or arrays of structures may be queried by calling glGetUniformLocation for each field within the structure. The array element operator "[]" and the structure field operator "." may be used in name​ in order to select elements within an array or fields within a structure. The result of using these operators is not allowed to be another structure, an array of structures, or a subcomponent of a vector or a matrix. Except if the last part of name​ indicates a uniform variable array, the location of the first element of an array can be retrieved by using the name of the array, or by using the name appended by "[0]".

GLSL : uniform buffer object example

I have an array of GLubyte of variable size. I want to pass it to fragment shader. I have seen
This thread and this thread. So I decided to use "Uniform Buffer Objects". But being a newbie in GLSL, I do not know:
1 - If I am going to add this to fragment shader, how do I pass size? Should I create a struct?
layout(std140) uniform MyArray
{
GLubyte myDataArray[size]; //I know GLSL doesn't understand GLubyte
};
2- how and where in C++ code associate this buffer object ?
3 - how to deal with casting GLubyte to float?
1 - If I am going to add this to fragment shader, how do I pass size? Should I create a struct?
Using Uniform Buffers (UB), you cannot do this.
size must be static and known when you link your GLSL program. This means it has to be hard-coded into the actual shader.
The modern way around this is to use a feature from GL4 called Shader Storage Buffers (SSB).
SSBs can have variable length (the last field can be declared as an unsized array, like myDataArray[]) and they can also store much more data than UBs.
In older versions of GL, you can use a Buffer Texture to pass large amounts of dynamically sized data into a shader, but that is a cheap hack compared to SSBs and you cannot access the data using a nice struct-like interface either.
3 - how to deal with casting GLubyte to float?
You really would not do this at all, it is considerably more complicated.
The smallest data type you can use in a GLSL data structure is 32-bit. You can pack and unpack smaller pieces of data into a uint if need though using special functions like packUnorm4x8 (...). That was done intentionally, to avoid having to define new data types with smaller sizes.
You can do that even without using any special GLSL functions.
packUnorm4x8 (...) is roughly equivalent to performing the following:
for (int i = 0; i < 4; i++)
packed += round (clamp (vec [i], 0, 1) * 255.0) * pow (2, i * 8);
It takes a 4-component vector of floating-point values in the range [0,1] and does fixed-point arithmetic to pack each of them into an unsigned normalized (unorm) 8-bit integer occupying its own 1/4 of a uint.
Newer versions of GLSL introduce intrinsic functions that do that, but GPUs have actually been doing that sort of thing for as long as shaders have been around. Anytime you read/write a GL_RGBA8 texture from a shader you are basically packing or unpacking 4 8-bit unorms represented by a 32-bit integer.

OpenGL Uniform Buffer std140 layout

I’m trying to pass an array of ints to the fragment shader via uniform block (everything is according to GLSL “#version 330”) on a GeForce 8600 GT.
On the side of the app I have:
int MyArray[7102];
…
//filling, binding, etc
…
glBufferData(GL_UNIFORM_BUFFER, sizeof(MyArray), MyArray, GL_DYNAMIC_DRAW);
In my fragment shader I declare according block as follows:
layout (std140) uniform myblock
{
int myarray[7102];
};
The problem is that after successful glCompileShader the glLinkProgram returns an error saying that it can’t bind appropriate storage resource.
Few additional facts:
1) GL_MAX_UNIFORM_BLOCK_SIZE returned value 65536
2) If I lower the number of elements to 4096 it works fine and makes no difference whether I use “int” or “ivec4” as the array type. Anything above 4096 gives me the same “storage error”
3) If I use “shared” or “packed” everything works as suspected
After consulting with GLSL 3.3 specification for std140, I’m assuming that there is a problem with aligning/padding according to:
“1) If the member is a scalar consuming N basic machine units, the base alignment
is N.
...
4) If the member is an array of scalars or vectors, the base alignment and array
stride are set to match the base alignment of a single array element, according
to rules (1), (2), and (3), and rounded up to the base alignment of a vec4. The
array may have padding at the end; the base offset of the member following
the array is rounded up to the next multiple of the base alignment.”
My questions:
1) Is it true that “myblock” occupies 4 times bigger than just 7102*4=28408 bytes? I.e. std140 expands each member of myarray to vec4 and the real memory usage is 7102*4*4=113632 bytes which is the cause of the problem?
2) The reason it works with “shared” or “packed” is due to the elimination of these gaps because of optimization?
3) Maybe it’s a driver bug? All facts point to the “…and rounded up to the base alignment of a vec4” being the reason, but it’s quite hard to accept that something as simple as array of ints ends up being 4 times less effective in terms of memory constraints.
4) If it’s not a bug, then how should I organize and access an array in case of std140? I can use “ivec4” for optimal data distribution but then instead of simple x=myarray[i] I have to sacrifice performance doing something like x=myarray[i/4][i%4] to refer to individual elements of each ivec4? Or am I missing something and there is obvious solution?
1) (…) rounded up to the base alignment of a vec4? (…)
Yes.
2) The reason it works with “shared” or “packed” is due to the elimination of these gaps because of optimization?
Yes; only that this is not optimization performance wise.
3) Maybe it’s a driver bug?
EDIT No. GPUs natually work with vectorized typed. Packing the types require to add further instructions to de-/multiplex the vectors. In the time being since writing this answer significant changes to GPU architectures happend. GPUs made these days are all single scalar architectures with the design emphased on strong superscalar vectorization.
4) If it’s not a bug, then how should I organize and access an array in case of std140?
Don't use uniform buffer objects for such large data. Put the data into a 1D texture and use texelFetch to index into it.