Vulkan: weird performance of uniform buffer - glsl

One of the inputs of my fragment shader is an array of 5 structures. The shader computes a color based on each of the 5 structures. In the end, these 5 colors are summed together to produce the final output. The total size of the array is 1440 bytes. To accommodate the alignment of the uniform buffer, the size of the uniform buffer changes to 1920 bytes.
1- If I define the array of 5 structures as a uniform buffer array, the rendering takes 5ms (measured by Nsight Graphics). The uniform buffer's memory property is 'VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT'. The uniform buffer in glsl is defined as follows
layout(set=0,binding=0) uniform UniformStruct { A a; } us[];
layout(location=0) out vec4 c;
void main()
{
vec4 col = vec4(0);
for (int i = 0; i < 5; i++)
col += func(us[nonuniformEXT(i)]);
c = col;
}
Besides, I'm using 'GL_EXT_nonuniform_qualifier' extension to access the uniform buffer array. This seems the most straightforward way for me but there are alternative implementations.
2- I can split the rendering from one vkCmdDraw to five vkCmdDraw, change the framebuffer's blend mode from overwriting to addition and define a uniform buffer instead of a uniform buffer array in the fragment shader. On the CPU side, I change the descriptor type from UNIFORM_BUFFER to UNIFORM_BUFFER_DYNAMICS. Before each vkCmdDraw, I bind the dynamic uniform buffer and the corresponding offsets. In the fragment shader, the for loop is removed. Although it seems that it should be slower than the first method, it is surprisingly much faster than the first method. The rendering only takes 2ms total for 5 draws.
3- If I define the array of 5 structures as a storage buffer and do one vkCmdDraw, the rendering takes only 1.4ms. In other words, if I change the array from the uniform buffer array to storage buffer but keep anything else the same as 1, it becomes faster.
4- If I define the array of 5 structures as a global constant in the glsl and do one vkCmdDraw, the rendering takes only 0.5ms.
In my opinion, 4 should be the fastest way, which is true in the test. Then 1 should be the next. Both 2 and 3 should be slower than 1. However, Neither 2 or 3 is slower than 1. In contrast, they are much faster than 1. Any ideas why using uniform buffer array slows down the rendering? Is it because it is a host visible buffer?

When it comes to UBOs, there are two kinds of hardware: the kind where UBOs are specialized hardware and the kind where they aren't. For GPUs where UBOs are not specialized hardware, a UBO is really just a readonly SSBO. You can usually tell the difference because hardware where UBOs are specialized will have different size limits on them from those of SSBOs.
For specialized hardware-based UBOs (which NVIDIA still uses, if I recall correctly), each UBO represents an upload from memory into a big block of constant data that all invocations of a particular shader stage can access.
For this kind of hardware, an array of UBOs is basically creating an array out of segments of this block of constant data. And some hardware has multiple blocks of constant data, so indexing then with non-constant expressions is tricky. This is why non-constant access to such indices is an optional feature of Vulkan.
By contrast, a UBO which contains an array is just one big UBO. It's special only in how big it is. Indexing through an array within a UBO is no different from indexing any array. There are no special rules with regard to the uniformity of the index of such accesses.
So stop using an array of UBOs and just use a single UBO which contains an array of data:
layout(set=0,binding=0) uniform UniformStruct { A a[5]; } us;
It'll also avoid additional padding due to alignment, additional descriptors, additional buffers, etc.
However, you might also speed things up by not lying to Vulkan. The expression nonuniformEXT(i) states that the expression i is not dynamically uniform. This is incorrect. Every shader invocation that executes this loop will generate i expressions that have values from 0 to 4. Every dynamic instance of the expression i for any invocation will have the same value at that place in the code as every other.
Therefore i is dynamically uniform, so telling Vulkan that it isn't is not helpful.

Related

Should Uniform Buffer Objects be used over glUniform when uniform block size is small?

I've recently replaced all usage of glUniform() in my code to instead use Uniform Buffer Objects (UBO) to better organize my object rendering code. However, my performance has sharply plummeted as the number of objects drawn increases (an object in my case is anything with it's own offset into a UBO; only one UBO is shared between everything that uses the same shader program).
I think this may be because my objects actually use very little uniform data, sometimes as little as one vec4, while a UBO requires each offset be in multiples of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT which can be rather large.
I understood UBO to be a strict upgrade to using glUniform(), am I mistaken? Should I have a glUniform() path for these simple objects, or is there a necessary step to using UBOs effectively with many different simple objects that I haven't taken?
My scene is mostly quads, with very few objects using more than 4 verts, 6 indices. Some of these quads have large uniform blocks while most have very simple ones, maybe one or two vec4.
(For now, I've gotten around this problem by embedding this object data into the vertices, though it definitely feels wrong to do it this way.)

Shader storage buffer object slow – alternative?

I'm trying to make an array of vec3 available to a fragment shader. In the targeted application, there could be several hundred elements.
I tested transferring data in the form of a shader storage buffer object, declared as
layout(binding = 0) buffer voxels { vec3 xyz[]; }
and set using glBufferData, but I found that my fragment shader becomes very slow, even with only 33 elements.
Moreover, when I convert the same data into the GLSL code of a const vec3[] and include it in the shader code, the shader becomes noticeably faster.
Is there a better way – faster than an SSBO and more elegant than creating shader code?
As might already be apparent from the above, the array is only read from in the shader. It is constant within the shader as well as over shader invocations for different fragments, so effectively a uniform, and it is set only once or a few times over the runtime of the program.
I'd recommend using std430 layout specifier on the SSBO given that you are using vec3 data types, otherwise you'll be forced to pad the data, which isn't going to be great. In general, if the buffer is a fixed size, then prefer using glBufferSubData instead of glBufferData (the latter may reallocate memory on the GPU).
As yet another alternative, if you are able to target GL 4.4+, consider using glBufferStorage instead (or even better, if GL4.5 is available, use glCreateuffers, and glNamedBufferStorage). This let's you pass a few more hints to the GL driver about the way in which the buffer will be consumed. I'd try out a few options (e.g. mapping v.s. sub-data v.s. recreating each time).

Can interface block size be larger than underlying UBO in Opengl?

Let's declare a large interface block in a shader:
struct InstancingData
{
// whatever
};
#define LARGE_SIZE 1048576
layout(std140, row_major, binding = 0) uniform InstanceBlock
{
InstancingData data[LARGE_SIZE];
};
Then I want to bind a small UBO, containing less than LARGE_SIZE entries of InstancingData to this block. It may be either glBingBufferBase of a small array, or glBindBufferRange of a small range within a larger array.
Consequently, I will index data only with indices smaller than underlying buffer size allows, using an appropriate uniform or gl_VertexID. So formally it shouldn't lead to access violation.
Will these actions trigger an error or undefined behavior in any OpenGL version?
Another way to go:
I declared
InstancingData data[1];
Then I bound a buffer of 42 structures and indexed all of them (a 6x7 square bed of instanced models), and it worked fine on my machine. Is it guaranteed to work anywhere?
From ARB_uniform_buffer_object:
If any active uniform block is not backed by a sufficiently large
buffer object, the results of shader execution are undefined, and may
result in GL interruption or termination.
This is sufficient to say that backing InstancingData data[LARGE_SIZE]; with a small buffer is illegal.

(OpenGL 3.1 - 4.2) Dynamic Uniform Arrays?

Lets say I have 2 species such as humans and ponies. They have different skeletal systems so the uniform bone array will have to be different for each species. Do I have to implement two separate shader programs able to render each bone array properly or is there a way to dynamically declare uniform arrays and iterate through that dynamic array instead?
Keeping in mind performance (There's all of the shaders suck at decision branching going around).
Until OpenGL 4.3, arrays in GLSL had to be of a fixed, compile-time size. 4.3 allows the use of shader storage buffer objects, which allow for their ultimate length to be "unbounded". Basically, you can do this:
buffer BlockName
{
mat4 manyManyMatrices[];
};
OpenGL will figure out how many matrices are in this array at runtime based on how you use glBindBufferRange. So you can still use manyManyMatrices.length() to get the length, but it won't be a compile-time constant.
However, this feature is (at the time of this edit) very new and only implemented in beta. It also requires GL 4.x-class hardware (aka: Direct3D 11-class hardware). Lastly, since it uses shader storage blocks, accessing the data may be slower than one might hope for.
As such, I would suggest that you just use a uniform block with the largest number of matrices that you would use. If that becomes a memory issue (unlikely), then you can split your shaders based on array size or use shader storage blocks or whatever.
You can use n-by-1-Textures as a replacement for arrays. Texture size can be specified at run-time. I use this approach for passing an arbitrary number of lights to my shaders. I'm surprised how fast it runs despite the many loops and branches. For an example see the polygon.f shader file in the jogl3.glsl.nontransp in the jReality sources.
uniform sampler2D sys_globalLights;
uniform int sys_numGlobalDirLights;
uniform int sys_numGlobalPointLights;
uniform int sys_numGlobalSpotLights;
...
int lightTexSize = sys_numGlobalDirLights*3+sys_numGlobalPointLights*3+sys_numGlobalSpotLights*5;
for(int i = 0; i < numDir; i++){
vec4 dir = texture(sys_globalLights, vec2((3*i+1+0.5)/lightTexSize, 0));
...

Dynamic number of uniform blocks

Running openGL 3.1, the question is simple.
From GLSL site, here is how one can define array of uniform buffer blocks:
uniform BlockName
{
vec3 blockMember1, blockMember2;
float blockMember3;
} multiBlocks[3];
Now, is it possible to have dynamic number of these multiBlocks? There are no pointers in GLSL so no "new" statement etc.
If not, is there other approach to send dynamic number of elements?
My block is currently packing four floats and one vec2.
I haven't wrote shader yet so you can suggest anything, thanks ;)
You can't have a dynamic number of them, and you can't have a dynamic index into them. That means that even if you could change the count dynamically, it would be of little use since you'd still have to change the shader code to access the new elements.
One possible alternative would be to make the block members arrays:
#define BLOCK_COUNT %d
uniform BlockName
{
vec3 blockMember1[BLOCK_COUNT];
vec3 blockMember2[BLOCK_COUNT];
float blockMember3[BLOCK_COUNT];
}
multiBlocks;
Then you can alter BLOCK_COUNT to change the number of members, and you can use dynamic indexes just fine:
multiBlocks.blockMember2[i];
It still doesn't allow you to alter the number of elements without recompiling the shader, though.
Ok so i wrote also to openGL forum and this came out
So basicaly you have 3 solutions:
Uniform buffer objects or Texture buffers or static array with some high number of prepared elements and use another uniform for specifying actual size.
The last one could be upgraded with OneSadCookie's definition of max size in compile time.