OpenGL Uniform Buffer std140 layout

OpenGL Uniform Buffer std140 layout - opengl

I’m trying to pass an array of ints to the fragment shader via uniform block (everything is according to GLSL “#version 330”) on a GeForce 8600 GT.
On the side of the app I have:
int MyArray[7102];
…
//filling, binding, etc
…
glBufferData(GL_UNIFORM_BUFFER, sizeof(MyArray), MyArray, GL_DYNAMIC_DRAW);
In my fragment shader I declare according block as follows:
layout (std140) uniform myblock
{
int myarray[7102];
};
The problem is that after successful glCompileShader the glLinkProgram returns an error saying that it can’t bind appropriate storage resource.
Few additional facts:
1) GL_MAX_UNIFORM_BLOCK_SIZE returned value 65536
2) If I lower the number of elements to 4096 it works fine and makes no difference whether I use “int” or “ivec4” as the array type. Anything above 4096 gives me the same “storage error”
3) If I use “shared” or “packed” everything works as suspected
After consulting with GLSL 3.3 specification for std140, I’m assuming that there is a problem with aligning/padding according to:
“1) If the member is a scalar consuming N basic machine units, the base alignment
is N.
...
4) If the member is an array of scalars or vectors, the base alignment and array
stride are set to match the base alignment of a single array element, according
to rules (1), (2), and (3), and rounded up to the base alignment of a vec4. The
array may have padding at the end; the base offset of the member following
the array is rounded up to the next multiple of the base alignment.”
My questions:
1) Is it true that “myblock” occupies 4 times bigger than just 7102*4=28408 bytes? I.e. std140 expands each member of myarray to vec4 and the real memory usage is 7102*4*4=113632 bytes which is the cause of the problem?
2) The reason it works with “shared” or “packed” is due to the elimination of these gaps because of optimization?
3) Maybe it’s a driver bug? All facts point to the “…and rounded up to the base alignment of a vec4” being the reason, but it’s quite hard to accept that something as simple as array of ints ends up being 4 times less effective in terms of memory constraints.
4) If it’s not a bug, then how should I organize and access an array in case of std140? I can use “ivec4” for optimal data distribution but then instead of simple x=myarray[i] I have to sacrifice performance doing something like x=myarray[i/4][i%4] to refer to individual elements of each ivec4? Or am I missing something and there is obvious solution?

1) (…) rounded up to the base alignment of a vec4? (…)
Yes.
2) The reason it works with “shared” or “packed” is due to the elimination of these gaps because of optimization?
Yes; only that this is not optimization performance wise.
3) Maybe it’s a driver bug?
EDIT No. GPUs natually work with vectorized typed. Packing the types require to add further instructions to de-/multiplex the vectors. In the time being since writing this answer significant changes to GPU architectures happend. GPUs made these days are all single scalar architectures with the design emphased on strong superscalar vectorization.
4) If it’s not a bug, then how should I organize and access an array in case of std140?
Don't use uniform buffer objects for such large data. Put the data into a 1D texture and use texelFetch to index into it.

Related

Vulkan: weird performance of uniform buffer

One of the inputs of my fragment shader is an array of 5 structures. The shader computes a color based on each of the 5 structures. In the end, these 5 colors are summed together to produce the final output. The total size of the array is 1440 bytes. To accommodate the alignment of the uniform buffer, the size of the uniform buffer changes to 1920 bytes.
1- If I define the array of 5 structures as a uniform buffer array, the rendering takes 5ms (measured by Nsight Graphics). The uniform buffer's memory property is 'VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT'. The uniform buffer in glsl is defined as follows
layout(set=0,binding=0) uniform UniformStruct { A a; } us[];
layout(location=0) out vec4 c;
void main()
{
vec4 col = vec4(0);
for (int i = 0; i < 5; i++)
col += func(us[nonuniformEXT(i)]);
c = col;
}
Besides, I'm using 'GL_EXT_nonuniform_qualifier' extension to access the uniform buffer array. This seems the most straightforward way for me but there are alternative implementations.
2- I can split the rendering from one vkCmdDraw to five vkCmdDraw, change the framebuffer's blend mode from overwriting to addition and define a uniform buffer instead of a uniform buffer array in the fragment shader. On the CPU side, I change the descriptor type from UNIFORM_BUFFER to UNIFORM_BUFFER_DYNAMICS. Before each vkCmdDraw, I bind the dynamic uniform buffer and the corresponding offsets. In the fragment shader, the for loop is removed. Although it seems that it should be slower than the first method, it is surprisingly much faster than the first method. The rendering only takes 2ms total for 5 draws.
3- If I define the array of 5 structures as a storage buffer and do one vkCmdDraw, the rendering takes only 1.4ms. In other words, if I change the array from the uniform buffer array to storage buffer but keep anything else the same as 1, it becomes faster.
4- If I define the array of 5 structures as a global constant in the glsl and do one vkCmdDraw, the rendering takes only 0.5ms.
In my opinion, 4 should be the fastest way, which is true in the test. Then 1 should be the next. Both 2 and 3 should be slower than 1. However, Neither 2 or 3 is slower than 1. In contrast, they are much faster than 1. Any ideas why using uniform buffer array slows down the rendering? Is it because it is a host visible buffer?

When it comes to UBOs, there are two kinds of hardware: the kind where UBOs are specialized hardware and the kind where they aren't. For GPUs where UBOs are not specialized hardware, a UBO is really just a readonly SSBO. You can usually tell the difference because hardware where UBOs are specialized will have different size limits on them from those of SSBOs.
For specialized hardware-based UBOs (which NVIDIA still uses, if I recall correctly), each UBO represents an upload from memory into a big block of constant data that all invocations of a particular shader stage can access.
For this kind of hardware, an array of UBOs is basically creating an array out of segments of this block of constant data. And some hardware has multiple blocks of constant data, so indexing then with non-constant expressions is tricky. This is why non-constant access to such indices is an optional feature of Vulkan.
By contrast, a UBO which contains an array is just one big UBO. It's special only in how big it is. Indexing through an array within a UBO is no different from indexing any array. There are no special rules with regard to the uniformity of the index of such accesses.
So stop using an array of UBOs and just use a single UBO which contains an array of data:
layout(set=0,binding=0) uniform UniformStruct { A a[5]; } us;
It'll also avoid additional padding due to alignment, additional descriptors, additional buffers, etc.
However, you might also speed things up by not lying to Vulkan. The expression nonuniformEXT(i) states that the expression i is not dynamically uniform. This is incorrect. Every shader invocation that executes this loop will generate i expressions that have values from 0 to 4. Every dynamic instance of the expression i for any invocation will have the same value at that place in the code as every other.
Therefore i is dynamically uniform, so telling Vulkan that it isn't is not helpful.

What use has the layout specifier scalar in EXT_scalar_block_layout?

Question
What use has the scalar layout specifier when accessing a storage buffer in GL_EXT_scalar_block_layout? (see below for example)
What would be use case for scalar?
Background
I recently programmed a simple Raytracer using Vulkan and NVidias VkRayTracing extension and was following this tutorial. In the section about the closest hit shader it is required to access some data that's stored in, well storage buffers (with usage flags vk::BufferUsageFlagBits::eStorageBuffer).
In the shader the extension GL_EXT_scalar_block_layout is used and those buffers are accessed like this:
layout(binding = 4, set = 1, scalar) buffer Vertices { Vertex v[]; } vertices[];
When I first used this code the validation layers told me that the structs like Vertex had an invalid layout, so I changed them to have each member aligned on 16byte blocks:
struct Vertex {
vec4 position;
vec4 normal;
vec4 texCoord;
};
with the corresponding struct in C++:
#pragma pack(push, 1)
struct Vertex {
glm::vec4 position_1unused;
glm::vec4 normal_1unused;
glm::vec4 texCoord_2unused;
};
#pragma pack(pop)
Errors disappeared and I got a working Raytracer. But I still don't understand why the scalar keyword is used here. I found this document talking about the GL_EXT_scalar_block_layout-extension, but I really don't understand it. Probably I'm just not used to glsl terminology? I can't see any reason why I would have to use this.
Also I just tried to remove the scalar and it still worked without any difference, warnings or erros whatsoever. Would be grateful for any clarification or further resources on this topic.

The std140 and std430 layouts do quite a bit of rounding of the offsets/alignments sizes of objects. std140 basically makes any non-scalar type aligned to the same alignment as a vec4. std430 relaxes that somewhat, but it still does a lot of rounding up to a vec4's alignment.
scalar layout means basically to layout the objects in accord with their component scalars. Anything that aggregates components (vectors, matrices, arrays, and structs) does not affect layout. In particular:
All types are sized/aligned only to the highest alignment of the scalar components that they actually use. So a struct containing a single uint is sized/aligned to the same size/alignment as a uint: 4 bytes. Under std140 rules, it would have 16-byte size and alignment.
Note that this layout makes vec3 and similar types actually viable, because C and C++ would then be capable of creating alignment rules that map to those of GLSL.
The array stride of elements in the array is based solely on the size/alignment of the element type, recursively. So an array of uint has an array stride of 4 bytes; under std140 rules, it would have a 16-byte stride.
Alignment and padding only matter for scalars. If you have a struct containing a uint followed by a uvec2, in std140/430, this will require 16 bytes, with 4 bytes of padding after the first uint. Under scalar layout, such a struct only takes 12 bytes (and is aligned to 4 bytes), with the uvec2 being conceptually misaligned. Padding therefore only exists if you have smaller scalars, like a uint16 followed by a uint.
In the specific case you showed, scalar layout was unnecessary since all of the types you used are vec4s.

How are GLSL varyings packed, and how many vec2's can I have?

I know that I can glGet with GL_MAX_VARYING_VECTORS to get the number of 4-component vectors I can use as varyings. What about other size vectors, matrices, and arrays of said types? How closely can these be packed; what are the restrictions?
I came across this question with a good answer. However, I am interested in desktop OpenGL, not OpenGL ES. Is there a difference in this case? I briefly searched this spec but found nothing useful.
Besides a general interest in how varyings are packed, I have a shader program whose sole varying is an array of vec2s. I want to programmatically get the largest size that this array can be. How can I derive that from GL_MAX_VARYING_VECTORS?
On my computer, GL_MAX_VARYING_VECTORS is 16, and I can size the array up to 62 before it won't compile.
Additionally, should I even be using varyings? I'm aware that newer versions of GLSL use an in/out syntax; should I switch, and do you know of any resources to get me started?

For all versions of Desktop OpenGL (where it matters), the limitations on the interface between shader stages are defined by the number of "components", not the number of "vectors". GL_MAX_VERTEX_OUTPUT_COMPONENTS, for example, defines the maximum number of output components the VS can generate.
A "component" is a float, integer, or boolean value. A vec3 takes up 3 components. A vec2[4] takes up 8 components.
So:
I want to programmatically get the largest size that this array can be. How can I derive that from GL_MAX_VARYING_VECTORS?
You don't. You derive it from the actual component count. In modern desktop GL 3.2+, this is defined by a per-stage value. For vertex shaders, this is GL_MAX_VERTEX_OUTPUT_COMPONENTS. For geometry shaders, this is GL_MAX_GEOMETRY_OUTPUT_COMPONENTS.
If the number of components is 64 (the minimum value that GL 4.x implementations will return), then you can have 32 vec2s.
According to the standard, at least. In practice, implementations have a tendency to vary with how these things work. Particularly in the early days, you were pretty much guaranteed that implementations would take each individual element of an array and expand it into a vec4.
Do they do that now? Well, either they do or they don't. If you manually pack your array of vec2s into an array of vec4s, then you're far more likely to work across platforms. So if you're not willing/able to test on numerous implementations, that's what I would do.
And yes, if you're using modern OpenGL implementations, you should be using in/out syntax. But that won't change anything in this regard; it's just syntax.

GLSL : uniform buffer object example

I have an array of GLubyte of variable size. I want to pass it to fragment shader. I have seen
This thread and this thread. So I decided to use "Uniform Buffer Objects". But being a newbie in GLSL, I do not know:
1 - If I am going to add this to fragment shader, how do I pass size? Should I create a struct?
layout(std140) uniform MyArray
{
GLubyte myDataArray[size]; //I know GLSL doesn't understand GLubyte
};
2- how and where in C++ code associate this buffer object ?
3 - how to deal with casting GLubyte to float?

1 - If I am going to add this to fragment shader, how do I pass size? Should I create a struct?
Using Uniform Buffers (UB), you cannot do this.
size must be static and known when you link your GLSL program. This means it has to be hard-coded into the actual shader.
The modern way around this is to use a feature from GL4 called Shader Storage Buffers (SSB).
SSBs can have variable length (the last field can be declared as an unsized array, like myDataArray[]) and they can also store much more data than UBs.
In older versions of GL, you can use a Buffer Texture to pass large amounts of dynamically sized data into a shader, but that is a cheap hack compared to SSBs and you cannot access the data using a nice struct-like interface either.
3 - how to deal with casting GLubyte to float?
You really would not do this at all, it is considerably more complicated.
The smallest data type you can use in a GLSL data structure is 32-bit. You can pack and unpack smaller pieces of data into a uint if need though using special functions like packUnorm4x8 (...). That was done intentionally, to avoid having to define new data types with smaller sizes.
You can do that even without using any special GLSL functions.
packUnorm4x8 (...) is roughly equivalent to performing the following:
for (int i = 0; i < 4; i++)
packed += round (clamp (vec [i], 0, 1) * 255.0) * pow (2, i * 8);
It takes a 4-component vector of floating-point values in the range [0,1] and does fixed-point arithmetic to pack each of them into an unsigned normalized (unorm) 8-bit integer occupying its own 1/4 of a uint.
Newer versions of GLSL introduce intrinsic functions that do that, but GPUs have actually been doing that sort of thing for as long as shaders have been around. Anytime you read/write a GL_RGBA8 texture from a shader you are basically packing or unpacking 4 8-bit unorms represented by a 32-bit integer.

Dynamic number of uniform blocks

Running openGL 3.1, the question is simple.
From GLSL site, here is how one can define array of uniform buffer blocks:
uniform BlockName
{
vec3 blockMember1, blockMember2;
float blockMember3;
} multiBlocks[3];
Now, is it possible to have dynamic number of these multiBlocks? There are no pointers in GLSL so no "new" statement etc.
If not, is there other approach to send dynamic number of elements?
My block is currently packing four floats and one vec2.
I haven't wrote shader yet so you can suggest anything, thanks ;)

You can't have a dynamic number of them, and you can't have a dynamic index into them. That means that even if you could change the count dynamically, it would be of little use since you'd still have to change the shader code to access the new elements.
One possible alternative would be to make the block members arrays:
#define BLOCK_COUNT %d
uniform BlockName
{
vec3 blockMember1[BLOCK_COUNT];
vec3 blockMember2[BLOCK_COUNT];
float blockMember3[BLOCK_COUNT];
}
multiBlocks;
Then you can alter BLOCK_COUNT to change the number of members, and you can use dynamic indexes just fine:
multiBlocks.blockMember2[i];
It still doesn't allow you to alter the number of elements without recompiling the shader, though.

Ok so i wrote also to openGL forum and this came out
So basicaly you have 3 solutions:
Uniform buffer objects or Texture buffers or static array with some high number of prepared elements and use another uniform for specifying actual size.
The last one could be upgraded with OneSadCookie's definition of max size in compile time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js