OpenGL TCS — limits on size of per-patch array - opengl

I'm somewhat confused about using arrays as per-patch variables in the TCS (and TES) of the OpenGL pipeline. The basic TCS example below works, but as soon as I increase the size of anotherTest to something above 32, I get
0(5) : error C5041: cannot locate suitable resource to bind variable "anotherTest". Possibly large array.
Using glGetIntegerv(GL_MAX_TESS_PATCH_COMPONENTS, &maxPatchComponents) in my C++ code (in a Qt 5.10 framework on Linux) I get
:: Using OpenGL 4.3.0 NVIDIA 396.24
→ maxPatchComponents = 120
Therefore, I'd say that I should be able to use even anotherTest[120] (note that gl_TessLevelOuter and gl_TessLevelInner do not count towards the number of patch components). So, what is going on? I know that 32 is the common limit for GL_MAX_PATCH_VERTICES (also on my machine), but that should not interfere with the size of my per-patch variable. Thoughts?
#version 430
// Tessellation control shader
layout (vertices = 4) out;
patch out int anotherTest[32];
void main(void) {
if (gl_InvocationID == 0) {
gl_TessLevelOuter[0] = 1;
gl_TessLevelOuter[1] = 64;
}
anotherTest[0] = 2;
gl_out[gl_InvocationID].gl_Position = gl_in[gl_InvocationID].gl_Position;
}

Related

when do i need GL_EXT_nonuniform_qualifier?

I want to compile the following code into SPIR-V
#version 450 core
#define BATCH_ID (PushConstants.Indices.x >> 16)
#define MATERIAL_ID (PushConstants.Indices.x & 0xFFFF)
layout (push_constant) uniform constants {
ivec2 Indices;
} PushConstants;
layout (constant_id = 1) const int MATERIAL_SIZE = 32;
in Vertex_Fragment {
layout(location = 0) vec4 VertexColor;
layout(location = 1) vec2 TexCoord;
} inData;
struct ParameterFrequence_3 {
int ColorMap;
};
layout (set = 3, binding = 0, std140) uniform ParameterFrequence_3 {
ParameterFrequence_3[MATERIAL_SIZE] data;
} Frequence_3;
layout (location = 0) out vec4 out_Color;
layout (set = 2, binding = 0) uniform sampler2D[] Sampler2DResources;
void main(void) {
vec4 color = vec4(1.0);
color *= texture(Sampler2DResources[Frequence_3.data[MATERIAL_ID].ColorMap], inData.TexCoord);
color *= inData.VertexColor;
out_Color = color;
}
(The code is generated by a program I am developing which is why the code might look a little strange, but it should make the problem clear)
When trying to do so, I am told
error: 'variable index' : required extension not requested: GL_EXT_nonuniform_qualifier
(for the third last line where the texture lookup also happens)
After I followed a lot of discussion around how dynamically uniform is specified and that the shading language spec basically says the scope is specified by the API while neither OpenGL nor Vulkan really do so (maybe that changed), I am confused why i get that error.
Initially I wanted to use instanced vertex attributes for the indices, those however are not dynamically uniform which is what I thought the PushConstants would be.
So when PushConstants are constant during the draw call (which is the max scope for dynamically uniform requirement), how can the above shader end up in any dynamically non-uniform state?
Edit: Does it have to do with the fact that the buffer backing the storage for the "ColorMap" could be aliased by another buffer via which the content might be modified during the invocation? Or is there a way to tell the compiler this is a "restricted" storage so it knows it is constant?
Thanks
It is 3 am in the morning over here, I should just go to sleep.
Chances anyone end up having the same problem are small, but I'd still rather answer it myself than delete it:
I simply had to add a SpecializationConstant to set the size of the sampler2D array, now it works without requiring any extension.
Good night

Can SSBO be read/write within the same shader?

I wrote a small tessellation program. I can write to a SSBO (checked output using RenderDoc) but reading the data back right away in the same shader (TCS) does not seem to work. If I set the tessellation levels directly, I can see that my code works:
In the main of the Tessellation Control shader:
gl_TessLevelInner[0] = 1;
gl_TessLevelOuter[0] = 1;
gl_TessLevelOuter[1] = 2;
gl_TessLevelOuter[2] = 4;
But going through the SSBO memory, it does not work. The display is blank like 0 were placed in the gl_TessLevelInner & gl_TessLevelOuter output.
Here is the SSBO in the TCS:
struct SSBO_Data {
float Inside; // Inside Tessellation factor
float Edges[3]; // Outside Tessellation factor
};
layout(std430, binding=2) volatile buffer Tiling {
SSBO_Data Tiles[];
};
In the main of the Tessellation Control shader
Tiles[0].Inside = 1;
Tiles[0].Edges[0] = 1;
Tiles[0].Edges[1] = 2;
Tiles[0].Edges[2] = 4;
gl_TessLevelInner[0] = Tiles[0].Inside;
gl_TessLevelOuter[0] = Tiles[0].Edges[0];
gl_TessLevelOuter[1] = Tiles[0].Edges[1];
gl_TessLevelOuter[2] = Tiles[0].Edges[2];
In C++, I use the ShaderBuffer class from nVidia to create an array of a few thousand tiles and transfer data to the SSBO. I confirmed that the correct data is stored in the SSBO using RenderDoc.
In the ShaderBuffer class, I tried changing the glBufferData usage to GL_DYNAMIC_DRAW instead of GL_STATIC_DRAW but it did not help.
I also set the SSBO to volatile but that did not help.
I also inserted a barrier(); between the writing and reading of the SSBO data and it did not help either.
Is it possible to use SSBO for writing and reading back within the same shader?
Writing to any incoherent memory location (SSBO or image load/store)from a single shader instance and then reading from the same location works in the same stage if and only if:
You are reading it in the same instance that did the writing.
Only that instance wrote to the memory being read.
#2 holds even if all instances are writing the same value. Violating #2 creates a race condition (again, regardless of the value written), which is UB.
I also inserted a barrier(); between the writing and reading of the SSBO data and it did not help either.
That's not going to do something useful for your use case, what you actually need is glMemoryBarrierBuffer().

Unexpeced value upon accessing an SSBO float

I am trying to calculate a morph offset for a gpu driven animation.
To that effect I have the following function (and SSBOS):
layout(std140, binding = 7) buffer morph_buffer
{
vec4 morph_targets[];
};
layout(std140, binding = 8) buffer morph_weight_buffer
{
float morph_weights[];
};
vec3 GetMorphOffset()
{
vec3 offset = vec3(0);
for(int target_index=0; target_index < target_count; target_index++)
{
float w1 = morph_weights[1];
offset += w1 * morph_targets[target_index * vertex_count + gl_VertexIndex].xyz;
}
return offset;
}
I am seeing strange behaviour so I opened renderdoc to trace the state:
As you can see, index 1 of the morph_weights SSBO is 0. However if I step over in the built in debugger for renderdoc I obtain:
Or in short, the variable I get back is 1, not 0.
So I did a little experiment and changed one of the values and now the SSBO looks like this:
And now I get this:
So my SSBO of type float is being treated like an ssbo of vec4's it seems. I am aware of alignment issues with vec3's, but IIRC floats are fair game. What is happenning?
Upon doing a little bit of asking around.
The issue is the SSBO is marked as std140, the correct std for a float array is std430.
For the vulkan GLSL dialect, an alternative is to use the scalar qualifier.

Dynamic indexing into uniform array of sampler2D doesn't work

I need to index into array of 2 uniform sampler2D. The index is dynamic per frame.That's,I have a dynamic uniform buffer which provides that index to a fragment shader. I use Vulkan API 1.2. In device feature listing I have:
shaderSampledImageArrayDynamicIndexing = 1
I am not sure 100% but It looks like this feature is core in 1.2.Nevertheless I did try to enable it during device creation like this:
VkPhysicalDeviceFeatures features = {};
features.shaderSampledImageArrayDynamicIndexing = VK_TRUE;
Then plugging into device creation:
VkDeviceCreateInfo deviceCreateInfo = {};
deviceCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;
deviceCreateInfo.pQueueCreateInfos = queueCreateInfos;
deviceCreateInfo.queueCreateInfoCount = 1;
deviceCreateInfo.pEnabledFeatures = &features ;
deviceCreateInfo.enabledExtensionCount = NUM_DEVICE_EXTENSIONS;
deviceCreateInfo.ppEnabledExtensionNames = deviceExtensionNames;
In the shader it looks like this:
layout(std140,set=0,binding=1)uniform Material
{
vec4 fparams0;
vec4 fparams1;
uvec4 iparams; //.z - array texture idx
uvec4 iparams1;
}material;
layout (set=1,binding = 0)uniform sampler2D u_ColorMaps[2];
layout (location = 0)in vec2 texCoord;
layout(location = 0) out vec4 outColor;
void main()
{
outColor = texture(u_ColorMaps[material.iparams.z],texCoord);
}
What I get is a combination of image pixels with some weird color. If I change to fixed indices - it works correctly. material.iparams.z param has been verified,it provides correct index number every frame (0 or 1). No idea what else is missing.Validation layers say nothing.
Mys setup: Windows, RTX3000 ,NVIDIA beta driver 443.41 (Vulkan 1.2)
Update:
I also found that dynamically indexed sampler return a value in Red channel (r)
which is close to one and zeros in GB. I don't set red color anyway,also the textures I fetch don't contain red. Here are two sreenshot, the upper is correct result which I get when indexing with constant value. Second is what happens when I index with dynamic uint which comes from dynamic UBO:
Correct
Wrong
The problem was due to usage of Y′CBCR samplers. It appears that Vulkan disallows indexing dynamically into array of such uniforms.
Here is what Vulkan specs says:
If the combined image sampler enables sampler Y′CBCR conversion or
samples a subsampled image, it must be indexed only by constant
integral expressions when aggregated into arrays in shader code,
irrespective of the shaderSampledImageArrayDynamicIndexing feature.
So,the solution for me was to provide two separately bound samplers and use dynamic indices with if()..else condition to decide which sampler to use. Push constants would also work,but in this case I have to re-record command buffers all the time. Hopefully this info will be helpful to other people working with video formats in Vulkan API.

How to extend vertex shader capabalities for GPGPU

I'm trying to implement Scrypt hasher (for LTC miner) on GLSL (don't ask me why).
And, actually, I'm stucked with HMAC SHA-256 algorithm. Despite I've implemented SHA-256 correctly (it retuns corrent hash for input), fragment shader stops to compile when I add the last step (hashing previous hash concated with oKey).
The shader can't do more than three rounds of SHA-256. It just stops to compile. What are the limits? It doesn't use much memory, 174 vec2 objects in total. It seems, it doesn't relate to memory, because any extra SHA256 round doesn't require new memory. And it seems, it doesn't relate to viewport size. It stops to work on both 1x1 and 1x128 viewports.
I've started to do miner on WebGL, but after limit appeared, I've tried to run the same shader in the Qt on the full featured OpenGL. In result, desktop OpenGL allows one SHA256 round lesser then OpenGL ES in WebGL (why?).
Forgot to mention. Shader fails on the linkage stage. The shader compiles well itself, but the program linkage fails.
I don't use any textures, any extensions, slow things etc. Just simple square (4 vec2 vertecies) and several uniforms for fragment shader.
Input data is just 80 bytes, the result of fragment shader is binary (black or white), so the task ideally fits the GLSL principes.
My videocard is Radeon HD7970 with plenty of VRAM, which is able to fit hundreds of scrypt threads (scrypt uses 128kB per hash, but I can't achieve just HMAC-SHA-256). My card supports OpenGL 4.4.
I'm newbie in OpenGL, and may understand something wrong. I understand that fragment shader runs for each pixel separately, but if I have 1x128 viewport, there are only 128x348 bytes used. Where is the limit of fragment shader.
Here is the common code I use to let you understand, how I'm trying to solve the problem.
uniform vec2 base_nonce[2];
uniform vec2 header[20]; /* Header of the block */
uniform vec2 H[8];
uniform vec2 K[64];
void sha256_round(inout vec2 w[64], inout vec2 t[8], inout vec2 hash[8]) {
for (int i = 0; i < 64; i++) {
if( i > 15 ) {
w[i] = blend(w[i-16], w[i-15], w[i-7], w[i-2]);
}
_s0 = e0(t[0]);
_maj = maj(t[0],t[1],t[2]);
_t2 = safe_add(_s0, _maj);
_s1 = e1(t[4]);
_ch = ch(t[4], t[5], t[6]);
_t1 = safe_add(safe_add(safe_add(safe_add(t[7], _s1), _ch), K[i]), w[i]);
t[7] = t[6]; t[6] = t[5]; t[5] = t[4];
t[4] = safe_add(t[3], _t1);
t[3] = t[2]; t[2] = t[1]; t[1] = t[0];
t[0] = safe_add(_t1, _t2);
}
for (int i = 0; i < 8; i++) {
hash[i] = safe_add(t[i], hash[i]);
t[i] = hash[i];
}
}
void main () {
vec2 key_hash[8]; /* Our SHA-256 hash */
vec2 i_key[16];
vec2 i_key_hash[8];
vec2 o_key[16];
vec2 nonced_header[20]; /* Header with nonce */
set_nonce_to_header(nonced_header);
vec2 P[32]; /* Padded SHA-256 message */
pad_the_header(P, nonced_header);
/* Hash HMAC secret key */
sha256(P, key_hash);
/* Make iKey and oKey */
for(int i = 0; i < 16; i++) {
if (i < 8) {
i_key[i] = xor(key_hash[i], vec2(Ox3636, Ox3636));
o_key[i] = xor(key_hash[i], vec2(Ox5c5c, Ox5c5c));
} else {
i_key[i] = vec2(Ox3636, Ox3636);
o_key[i] = vec2(Ox5c5c, Ox5c5c);
}
}
/* SHA256 hash of iKey */
for (int i = 0; i < 8; i++) {
i_key_hash[i] = H[i];
t[i] = i_key_hash[i];
}
for (int i = 0; i < 16; i++) { w[i] = i_key[i]; }
sha256_round(w, t, i_key_hash);
gl_FragColor = toRGBA(i_key_hash[0]);
}
What solutions can I use to improve the situation? Is there something cool in OpenGL 4.4, in OpenGL ES 3.1? Is it even possible to do such calculations and keep so much (128kB) in fragment shader? What are limits for the vertex shader? Can I do the same on the vertex shader instead the fragment?
I try to answer on the my own question.
Shader is a small processor with limited registers and cache memory. Also, there are limit for instruction execution. So, the whole architecture to fit all into one fragment shader is wrong.
On another way, you can change your shader programs during render tens or hundreds times. It is normal practice.
It is necessary to divide big computation into smaller parts and render them separately. Use render-to-texture to save your work.
Due to the webgl statistic, 96.5% of clients has MAX_TEXTURE_SIZE eq 4096. It gives you 32 megabytes of memory. It can contain the draft data for 256 threads of scrypt computation.