Is it allowed to use a VkBool32 as a push constant? - c++

I am trying to create a VkBool32 in my C++ code:
VkBool32 myBool = VK_FALSE;
and push it to GLSL via a push constant:
vkCmdPushConstants(..., sizeof(myBool), &myBool);
which is recieved by a bool inside a uniform storage class:
layout(push_constant) uniform PushConstants
{
bool myBool;
} pushConts;
First tests seem to work and have the intended behaviour. But is this permitted by the Vulkan Spec?

Using bools for push constants is fine. There is nothing in the specs that prohibits this and I'v been using it in a few examples too.
If you take a look at the human-readable SPIR-V output you'll see that they're converted to 32 bit integers and thus are aligned to 32 bit:
GLSL
layout (push_constant) uniform PushConsts {
bool calculateNormals;
} pushConsts;
SPIR-V
430(PushConsts): TypeStruct 40(int)
431: TypePointer PushConstant 430(PushConsts)
432(pushConsts): 431(ptr) Variable PushConstant
433: TypePointer PushConstant 40(int)
So if you e.g. would pass a struct containing multiple booleans you'd have to properly align (pad) on the CPU side before passing as a push constant.
As for the SPIR-V side of things, the official spec is always a good starting point and also contains details on how push constants are handled and how they differ.

Related

How to Use Vulkan SPIR-V Data Formats in GLSL

SPIR-V allows for very verbose data formats.
GLSL has only basic data types (Chapter 4) that do not specify bit length.
As far as I am aware the most convenient way to program shaders for Vulkan is to program them in GLSL, then use the Vulkan SDK provided compiler (glslc.exe) to convert the file into a SPIR-V binary.
My question is how does one use these verbose data formats such as the VK_FORMAT_R4G4_UNORM_PACK8 (found in the SPIR-V link above) In GLSL while using glslc.exe to compile our shader code. Are there special data types that the compiler allows for? If not is there an alternative higher level language that one could use and then compile into the binary?
For example if this was the attribute descriptions used in the graphics pipeline:
struct Attributes {
vec2 pos;
char flags;
};
static inline std::array<VkVertexInputAttributeDescription, 3> getAttributeDescriptions() {
std::array<VkVertexInputAttributeDescription, 3> attributeDescriptions{};
attributeDescriptions[0].binding = 0;
attributeDescriptions[0].location = 0;
attributeDescriptions[0].format = VK_FORMAT_R32G32_SFLOAT;
attributeDescriptions[0].offset = offsetof(Attributes, pos);
attributeDescriptions[1].binding = 0;
attributeDescriptions[1].location = 1;
attributeDescriptions[1].format = VK_FORMAT_R4G4_UNORM_PACK8;
attributeDescriptions[1].offset = offsetof(Attributes, flags);
return attributeDescriptions;
The proceeding GLSL shader code would look something like this:
#version 450
#extension GL_ARB_separate_shader_objects : enable
//Instance Attributes
layout(location = 0) in vec2 pos;
layout(location = 1) in 4BitVec2DataType flags;
//4BitVec2DataType is a placeholder for whatever GLSL's equivalent of SPIR-V's VK_FORMAT_R4G4_UNORM_PACK8 would be
void main() {
...
}
The proceeding GLSL shader code would look something like this:
No, it wouldn't. You would receive a vec2 in the shader, because that's how vertex attributes work. The vertex format is not meant to exactly match the data format; the data will be converted from that format to the shader-expected bitdepth. Unsigned normalized values are floating-point data, so a 2-vector UNORM maps to a GLSL vec2.
And BTW, SPIR-V does not change this. The shader's input size need not exactly match the given data size; any conversion is just baked into the shader (this is also part of why the vertex format is part of the pipeline).
The GL_EXT_shader_16bit_storage extension offers more flexibility in GLSL for creating unusual sizes of data types within buffer-backed interface blocks. But these are specifically for data in UBOs/SSBOs, not vertex formats. However, this extension requires the SPV_KHR_16bit_storage and SPV_KHR_8bit_storage SPIR-V extensions.

GLSL: about coherent qualifier

I didn't get clearly how coherent qualifier and atomic operations work together.
I perform some accumulating operation on the same SSBO location with this code:
uint prevValue, newValue;
uint readValue = ssbo[index];
do
{
prevValue = readValue;
newValue = F(readValue);
}
while((readValue = atomicCompSwap(ssbo[index], prevValue, newValue)) != prevValue);
This code works fine for me, but still, do I need to declare the SSBO (or Image) with coherent qualifier in this case?
And do I need to use coherent in a case when I call only atomicAdd?
When exactly do I need to use coherent qualifier? Do I need to use it only in case of direct writing: ssbo[index] = value;?
TL;DR
I found evidence that supports both answers regarding coherent.
Current score:
Requiring coherent with atomics: 1.5
Omitting coherent with atomics: 5.75
Bottom line, still not sure despite the score. Inside a single workgroup, I'm mostly convinced coherent is not required in practice. I'm not so sure in these cases:
more than 1 workgroup in glDispatchCompute
multiple glDispatchCompute calls that all access the same memory location (atomically) without any glMemoryBarrier between them
However, is there a performance cost to declaring SSBOs (or individual struct members) coherent when you only access them through atomic operations? Based on what is below, I don't believe there is because coherent adds "visibility" instructions or instruction flags at the variable's read or write operations. If a variable is only accessed through atomic operations, the compiler should hopefully:
ignore coherent when generating the atomic instructions because it has no effect
use the appropriate mechanic to make sure the result of the atomic operation is visible outside the shader invocation, warp, workgroup or rendering command.
From the OpenGL wiki's "Memory Model" page:
Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent qualifiers, barriers, and the like. (removed on 2020-04-12)
However, if memory has been modified in an incoherent fashion, any subsequent reads from that memory are not automatically guaranteed to see these changes.
+1 for requiring coherent
The code from Intel's article "OpenGL Performance Tips: Atomic Counter Buffers versus Shader Storage Buffer Objects"
// Fragment shader used bor ACB gets output color from a texture
#version 430 core
uniform sampler2D texUnit;
layout(binding = 0) uniform atomic_uint acb[ s(nCounters) ];
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
void main()
{
for (int i=0; i< s(nCounters) ; ++i) atomicCounterIncrement(acb[i]);
fragColor = texture(texUnit, texcoord);
}
// Fragment shader used for SSBO gets output color from a texture
#version 430 core
uniform sampler2D texUnit;
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
layout(std430, binding = 0) buffer ssbo_data
{
uint v[ s(nCounters) ];
};
void main()
{
for (int i=0; i< s(nCounters) ; ++i) atomicAdd(v[i], 1);
fragColor = texture(texUnit, texcoord);
}
Notice that ssbo_data in the second shader is not declared coherent.
The article also states:
The OpenGL foundation recommends using [atomic counter buffers] over SSBOs for various reasons; however improved performance is not one of them. This is because ACBs are internally implemented as SSBO atomic operations; therefore there are no real performance benefits from utilizing ACBs.
So atomic counters are actually the same thing as SSBOs apparently. (But what are those "various reasons" and where are those recommendations? Is Intel hinting at a conspiracy in favor of atomic counters...?)
+1 for omitting coherent
GLSL specification
The GLSL spec uses different wording when describing coherent and atomic operations (emphasis mine):
(4.10) When accessing memory using variables not declared as coherent, the memory accessed by a shader may be cached by the implementation to service future accesses to the same address. Memory stores may be cached in such a way that the values written might not be visible to other shader invocations accessing the same memory. The implementation may cache the values fetched by memory reads and return the same values to any shader invocation accessing the same memory, even if the underlying memory has been modified since the first memory read.
(8.11) Atomic memory functions perform atomic operations on an individual signed or unsigned integer stored in buffer-object or shared-variable storage. All of the atomic memory operations read a value from memory, compute a new value using one of the operations described below, write the new value to memory, and return the original value read. The contents of the memory being updated by the atomic operation are guaranteed not to be modified by any other assignment or atomic memory function in any shader invocation between the time the original value is read and the time the new value is written.
All the built-in functions in this section accept arguments with combinations of restrict, coherent, and volatile memory qualification, despite not having them listed in the prototypes. The atomic operation will operate as required by the calling argument’s memory qualification, not by the built-in function’s formal parameter memory qualification.
So on the one hand atomic operations are supposed to work directly with the storage's memory (does that imply bypassing possible caches?). On the other hand, it seems that memory qualifications (e.g. coherent) play a role in what the atomic operation does.
+0.5 for requiring coherent
OpenGL specification
The OpenGL 4.6 spec sheds more light on this issue in section 7.13.1 "Shader Memory Access Ordering"
The built-in atomic memory transaction and atomic counter functions may be used to read and write a given memory address atomically. While built-in atomic functions issued by multiple shader invocations are executed in undefined order relative to each other, these functions perform both a read and a write of a memory address and guarantee that no other memory transaction will write to the underlying memory between the read and write. Atomics allow shaders to use shared global addresses for mutual exclusion or as counters, among other uses.
The intent of atomic operations then clearly seems to be, well, atomic all the time and not depending on a coherent qualifier. Indeed, why would one want an atomic operation that isn't somehow combined between different shader invocations? Incrementing a locally cached value from multiple invocations and having all of them eventually write a completely independent value makes no sense.
+1 for omitting coherent
OpenGL spec issue #14
OpenGL 4.6: Do atomic counter buffers require the use of glMemoryBarrier calls to be able to access the counter?
We discussed this again in the OpenGL|ES meeting. Based on feedback from IHVs and their implementation of atomic counters we're planning to treat them like we treat other resources like image atomic, image load/store, buffer variables, etc. in that they require explicit synchronization from the application. The spec will be changed to add "atomic counters" to the places where the other resources are enumerated.
The described spec change occurred in OpenGL 4.5 to 4.6, but relates to glMemoryBarrier which plays no part in inside a single glDispatchCompute.
no effect
Example Shader
Let's inspect the assembly produced by two simple shaders to see what happens in practice.
#version 460
layout(local_size_x = 512) in;
// Non-coherent qualified SSBO
layout(binding=0) restrict buffer Buf { uint count; } buf;
// Coherent qualified SSBO
layout(binding=1) coherent restrict buffer Buf_coherent { uint count; } buf_coherent;
void main()
{
// First shader with atomics (v1)
uint read_value1 = atomicAdd(buf.count, 2);
uint read_value2 = atomicAdd(buf_coherent.count, 4);
// Second shader with non-atomic add (v2)
buf.count += 2;
buf_coherent.count += 4;
}
The second shader is used to compare the effects of the coherent qualifier between atomic operations and non-atomic operations.
AMD
AMD publishes Instruction Set Architecture (ISA) Documents which coupled with the Radeon GPU Analyzer gives insight into how GPUs actually implement this.
Shader v1 (Vega gfx900)
s_getpc_b64 s[0:1] BE801C80
s_mov_b32 s0, s2 BE800002
s_mov_b64 s[2:3], exec BE82017E
s_ff1_i32_b64 s4, exec BE84117E
s_lshl_b64 s[4:5], 1, s4 8E840481
s_and_b64 s[4:5], s[4:5], exec 86847E04
s_and_saveexec_b64 s[4:5], s[4:5] BE842004
s_cbranch_execz label_0010 BF880008
s_load_dwordx4 s[8:11], s[0:1], 0x00 C00A0200 00000000
s_bcnt1_i32_b64 s2, s[2:3] BE820D02
s_mulk_i32 s2, 0x0002 B7820002
v_mov_b32 v0, s2 7E000202
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
label_0010:
s_mov_b64 exec, s[4:5] BEFE0104
s_mov_b64 s[2:3], exec BE82017E
s_ff1_i32_b64 s4, exec BE84117E
s_lshl_b64 s[4:5], 1, s4 8E840481
s_and_b64 s[4:5], s[4:5], exec 86847E04
s_and_saveexec_b64 s[4:5], s[4:5] BE842004
s_cbranch_execz label_001F BF880008
s_load_dwordx4 s[8:11], s[0:1], 0x20 C00A0200 00000020
s_bcnt1_i32_b64 s0, s[2:3] BE800D02
s_mulk_i32 s0, 0x0004 B7800004
v_mov_b32 v0, s0 7E000200
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
label_001F:
s_endpgm BF810000
(Don't know why the exec mask and branching is used here...)
We can see that both atomic operations (on coherent and non-coherent buffers) result in the same instruction on all supported architectures of the Radeon GPU Analyzer:
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
Decoding this instruction shows that the GLC (Globally Coherent) flag is set to 0 which means for atomic operations: "Previous data value is not returned. No L1 persistence across wavefronts". Modifying the shader to use the returned values changes the GLC flag of both atomic instructions to 1 which means: "Previous data value is returned. No L1 persistence across wavefronts".
The documents dating from 2013 (Sea Islands, etc.) have an interesting description of the BUFFER_ATOMIC_<op> instructions:
Buffer object atomic operation. Always globally coherent.
So on AMD hardware, it appears coherent has no effect for atomic operations.
Shader v2 (Vega gfx900)
s_getpc_b64 s[0:1] BE801C80
s_mov_b32 s0, s2 BE800002
s_load_dwordx4 s[4:7], s[0:1], 0x00 C00A0100 00000000
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_load_dword v0, v0, s[4:7], 0 E0500000 80010000
s_load_dwordx4 s[0:3], s[0:1], 0x20 C00A0000 00000020
s_waitcnt vmcnt(0) BF8C0F70
v_add_u32 v0, 2, v0 68000082
buffer_store_dword v0, v0, s[4:7], 0 glc E0704000 80010000
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_load_dword v0, v0, s[0:3], 0 glc E0504000 80000000
s_waitcnt vmcnt(0) BF8C0F70
v_add_u32 v0, 4, v0 68000084
buffer_store_dword v0, v0, s[0:3], 0 glc E0704000 80000000
s_endpgm BF810000
The buffer_load_dword operation on the coherent buffer uses the glc flag and the other one does not as expected.
On AMD: +1 for omitting coherent
NVIDIA
It's possible to get the assembly of a shader by inspecting the blob returned by glGetProgramBinary(). The instructions are described in NV_gpu_program4, NV_gpu_program5 and NV_gpu_program5_mem_extended.
Shader v1
!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
ATOMB.ADD.U32 R0.x, {2, 0, 0, 0}, sbo_buf0[0];
ATOMB.ADD.U32 R0.x, {4, 0, 0, 0}, sbo_buf1[0];
END
There is no difference whether coherent is present or not.
Shader v2
!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
LDB.U32 R0.x, sbo_buf0[0];
ADD.U R0.x, R0, {2, 0, 0, 0};
STB.U32 R0, sbo_buf0[0];
LDB.U32.COH R0.x, sbo_buf1[0];
ADD.U R0.x, R0, {4, 0, 0, 0};
STB.U32 R0, sbo_buf1[0];
END
The LDB.U32 operation on the coherent buffer uses the COH modifier which means "Make LOAD and STORE operations use coherent caching".
On NVIDIA: +1 for omitting coherent
SPIR-V (with Vulkan target)
Let's see what SPIR-V code is generated by the glslang SPIR-V generator.
Shader v1
// Generated with glslangValidator.exe -H --target-env vulkan1.1
// Module Version 10300
// Generated by (magic number): 80008
// Id's are bound by 30
Capability Shader
1: ExtInstImport "GLSL.std.450"
MemoryModel Logical GLSL450
EntryPoint GLCompute 4 "main"
ExecutionMode 4 LocalSize 512 1 1
Source GLSL 460
Name 4 "main"
Name 8 "read_value1"
Name 9 "Buf"
MemberName 9(Buf) 0 "count"
Name 11 "buf"
Name 20 "read_value2"
Name 21 "Buf_coherent"
MemberName 21(Buf_coherent) 0 "count"
Name 23 "buf_coherent"
MemberDecorate 9(Buf) 0 Restrict
MemberDecorate 9(Buf) 0 Offset 0
Decorate 9(Buf) Block
Decorate 11(buf) DescriptorSet 0
Decorate 11(buf) Binding 0
MemberDecorate 21(Buf_coherent) 0 Coherent
MemberDecorate 21(Buf_coherent) 0 Restrict
MemberDecorate 21(Buf_coherent) 0 Offset 0
Decorate 21(Buf_coherent) Block
Decorate 23(buf_coherent) DescriptorSet 0
Decorate 23(buf_coherent) Binding 1
Decorate 29 BuiltIn WorkgroupSize
2: TypeVoid
3: TypeFunction 2
6: TypeInt 32 0
7: TypePointer Function 6(int)
9(Buf): TypeStruct 6(int)
10: TypePointer StorageBuffer 9(Buf)
11(buf): 10(ptr) Variable StorageBuffer
12: TypeInt 32 1
13: 12(int) Constant 0
14: TypePointer StorageBuffer 6(int)
16: 6(int) Constant 2
17: 6(int) Constant 1
18: 6(int) Constant 0
21(Buf_coherent): TypeStruct 6(int)
22: TypePointer StorageBuffer 21(Buf_coherent)
23(buf_coherent): 22(ptr) Variable StorageBuffer
25: 6(int) Constant 4
27: TypeVector 6(int) 3
28: 6(int) Constant 512
29: 27(ivec3) ConstantComposite 28 17 17
4(main): 2 Function None 3
5: Label
8(read_value1): 7(ptr) Variable Function
20(read_value2): 7(ptr) Variable Function
15: 14(ptr) AccessChain 11(buf) 13
19: 6(int) AtomicIAdd 15 17 18 16
Store 8(read_value1) 19
24: 14(ptr) AccessChain 23(buf_coherent) 13
26: 6(int) AtomicIAdd 24 17 18 25
Store 20(read_value2) 26
Return
FunctionEnd
The only difference between buf and buf_coherent is the decoration of the latter with MemberDecorate 21(Buf_coherent) 0 Coherent. Their usage afterwards is identical.
Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:
Capability Shader
+ Capability VulkanMemoryModelKHR
+ Extension "SPV_KHR_vulkan_memory_model"
1: ExtInstImport "GLSL.std.450"
- MemoryModel Logical GLSL450
+ MemoryModel Logical VulkanKHR
EntryPoint GLCompute 4 "main"
Decorate 11(buf) Binding 0
- MemberDecorate 21(Buf_coherent) 0 Coherent
MemberDecorate 21(Buf_coherent) 0 Restrict
which means... I don't quite know because I'm not versed in Vulkan's intricacies. I did found this informative section of the "Memory Model" appendix in the Vulkan 1.2 spec:
While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations. GLSL to SPIR-V compilers should map all (non-atomic) operations on a coherent variable to Make{Pointer,Texel}{Available}{Visible} flags in this model.
Atomic operations implicitly have availability/visibility operations, and the scope of those operations is taken from the atomic operation’s scope.
Shader v2
(skipping full output)
The only difference between buf and buf_coherent is again MemberDecorate 18(Buf_coherent) 0 Coherent.
Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:
- MemberDecorate 18(Buf_coherent) 0 Coherent
- 23: 6(int) Load 22
- 24: 6(int) IAdd 23 21
- 25: 13(ptr) AccessChain 20(buf_coherent) 11
- Store 25 24
+ 23: 6(int) Load 22 MakePointerVisibleKHR NonPrivatePointerKHR 24
+ 25: 6(int) IAdd 23 21
+ 26: 13(ptr) AccessChain 20(buf_coherent) 11
+ Store 26 25 MakePointerAvailableKHR NonPrivatePointerKHR 24
Notice the addition of MakePointerVisibleKHR and MakePointerAvailableKHR that control operation coherency at the instruction level instead of the variable level.
+1 for omitting coherent (maybe?)
CUDA
The Parallel Thread Execution ISA section of the CUDA Toolkit documentation has this information:
8.5. Scope
Each strong operation must specify a scope, which is the set of threads that may interact directly with that operation and establish any of the relations described in the memory consistency model. There are three scopes:
Table 18. Scopes
.cta: The set of all threads executing in the same CTA as the current thread.
.gpu: The set of all threads in the current program executing on the same compute device as the current thread. This also includes other kernel grids invoked by the host program on the same compute device.
.sys The set of all threads in the current program, including all kernel grids invoked by the host program on all compute devices, and all threads constituting the host program itself.
Note that the warp is not a scope; the CTA is the smallest collection of threads that qualifies as a scope in the memory consistency model.
Regarding CTA:
A cooperative thread array (CTA) is a set of concurrent threads that execute the same kernel program. A grid is a set of CTAs that execute independently.
So in GLSL terms, CTA == work group and grid == glDispatchCompute call.
The atom instruction description:
9.7.12.4. Parallel Synchronization and Communication Instructions: atom
Atomic reduction operations for thread-to-thread communication.
[...]
The optional .scope qualifier specifies the set of threads that can directly observe the memory synchronizing effect of this operation, as described in the Memory Consistency Model.
[...]
If no scope is specified, the atomic operation is performed with .gpu scope.
So by default, all shader invocations of a glDispatchCompute would see the result of an atomic operation... unless the GLSL compiler generates something that uses the cta scope in which case it would only be visible inside the workgroup. This latter case however corresponds to shared GLSL variables so perhaps it's only used for those and not for SSBO operations. NVIDIA isn't very open about this process so I haven't found a way to tell for sure (perhaps with glGetProgramBinary). However, since the semantics of cta map to a work group and gpu to buffers (i.e. SSBO, images, etc), I declare:
+0.5 for omitting coherent
Empirical evidence
I have written a particle system compute shader that uses an SSBO backed variable as an operand to atomicAdd() and it works. Usage of of coherent was not necessary even with a work group size of 512. However, there was never more than 1 work group. This was tested mainly on an Nvidia GTX 1080 so as seen above, atomic operations on NVIDIA seem to always be at least visible inside the work group.
+0.25 for omitting coherent

Having a non bound sampler inside an uniform branch

Lets say I have pixel shader that sometimes need to read from one sampler and sometimes needs to read from two different samplers, depending on a uniform variable
layout (set = 0, binding = 0) uniform UBO {
....
bool useSecondTexture;
} ubo;
...
void main() {
vec3 value0 = texture(sampler1, pos).rgb;
vec3 value2 = vec3(0,0,0);
if(ubo.useSecondTexture) {
value2 = texture(sampler2, pos).rgb;
}
value0 += value2;
}
Does the second sampler; sampler2 need to be bound to a valid texture even though the texture will not be read if useSecondTexture is false.
All of the vkCmdDraw and vkCmdDispatch commands have this Valid Usage statement:
Descriptors in each bound descriptor set, specified via vkCmdBindDescriptorSets, must be valid if they are statically used by the currently bound VkPipeline object, specified via vkCmdBindPipeline
Since sampler2 is statically used, you must have a valid descriptor for it or you'll get undefined behavior.
My guess is that on some implementations, it'll work as you expect. But drivers/hardware are allowed to require that all descriptors that might be used by a pipeline are valid, and requiring them to inspect the contents of memory buffers to determine if something might be used would be very expensive.

GL_SHADER_STORAGE_BUFFER memory limitations

I'm writing ray-tracing on OGL computing shaders, to pass data to and from shaders I use buffers.
When size of vec2 output buffer (which is equal to number of rays multiplied by number of faces) reaches ~30Mb attempt of mapping buffer is stable returning NULL pointer. Range mapping also fails.
I can't find any info about GL_SHADER_STORAGE_BUFFER limitations in ogl documentation, but maybe someone can help me, is ~30Mb limit or this mapping-fail may happen because of something different?
And is there any way to avoid this except for calling shader multiple times?
Data declaration in shader:
#version 440
layout(std430, binding=0) buffer rays{
vec4 r[];
};
layout(std430, binding=1) buffer faces{
vec4 f[];
};
layout(std430, binding=2) buffer outputs{
vec2 o[];
};
uniform int face_count;
uniform vec4 origin;
Calling code (using some Qt5 wrappers):
QOpenGLBuffer ray_buffer;
QOpenGLBuffer face_buffer;
QOpenGLBuffer output_buffer;
QVector<QVector2D> output;
output.resize(rays[r].size()*faces.size());
if(!ray_buffer.create()) { /*...*/ }
if(!ray_buffer.bind()) { /*...*/ }
ray_buffer.allocate(rays.data(), rays.size()*sizeof(QVector4D));
if(!face_buffer.create()) { /*...*/ }
if(!face_buffer.bind()) { /*...*/ }
face_buffer.allocate(faces.data(), faces.size()*sizeof(QVector4D));
if(!output_buffer.create()) { /*...*/ }
if(!output_buffer.bind()) { /*...*/ }
output_buffer.allocate(output.size()*sizeof(QVector2D));
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ray_buffer.bufferId());
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, face_buffer.bufferId());
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, output_buffer.bufferId());
int face_count = faces.size();
compute.setUniformValue("face_count", face_count);
compute.setUniformValue("origin", pos);
ogl->glDispatchCompute(rays.size()/256, faces.size(), 1);
ray_buffer.destroy();
face_buffer.destroy();
QVector2D* data = (QVector2D*)output_buffer.map(QOpenGLBuffer::ReadOnly);
First of all, you have to understand that the OpenGL specification defines minimum maxima for a variety of values (the ones starting with a MAX_{*} prefix). That means that implementations are required to at least provide the specified amount as the maximum value, but are free to increase the limit as implementors see fit. This way, developers can at least rely on some upper bound, but can still make provisions for possibly larger values.
Section 23 - State Tables summarizes what has been previously specified in the corresponding sections. The information you were looking for is found in table 23.64 - Implementation Dependent Aggregate Shader Limits (cont.). If you want to know about which state belongs where (because there is per-object state, quasi-global state, program state and so on), you go to section 23.
The minimum maximum size of a shader storage buffer is represented by the symbolic constant MAX_SHADER_STORAGE_BLOCK_SIZE as per section 7.8 of the core OpenGL 4.5 specification.
Since their adoption into core, the required size (i.e. the minimum maximum) has been significantly increased. In core OpenGL 4.3 and 4.4, the minimum maximum was pow(2, 24) (or 16MB with 1 byte basic machine units and 1MB = 1024^2 bytes) - in core OpenGL 4.5 this value is now pow(2, 27) (or 128MB)
Summary: When in doubt about OpenGL state, refer to section 23 of the core specification.
From OpenGL Wiki:
SSBOs can be much larger. The OpenGL spec guarantees that UBOs can be
up to 16KB in size (implementations can allow them to be bigger). The
spec guarantees that SSBOs can be up to 128MB. Most implementations
will let you allocate a size up to the limit of GPU memory.
OpenGL < 4.5 guarantees only 16MiB (OpenGL 4.5 increased the minimum to 128MiB) , you can try using glGet() to query if you can bind more.
GLint64 max;
glGetInteger64v(GL_MAX_SHADER_STORAGE_BLOCK_SIZE, &max);
In fact problem seems to be in Qt wrappers. Didn't look in-depth, but when I've changed QOpenGLBuffer's create(), bind(), allocate() and map() to glCreateBuffers(), glBindBuffer(), glNamedBufferData() and glMapNamedBuffer(), all called through QOpenGLFunctions_4_5_Core, memory problem was gone until I reached 2Gb (which is GPU physical memory limit).
Second error I've made was not using glMemoryBarrier(), but it didn't help while QOpenGLBuffer was in use.

Issue with glBindBufferRange() OpenGL 3.1

My vertex shader is ,
uniform Block1{ vec4 offset_x1; vec4 offset_x2;}block1;
out float value;
in vec4 position;
void main()
{
value = block1.offset_x1.x + block1.offset_x2.x;
gl_Position = position;
}
The code I am using to pass values is :
GLfloat color_values[8];// contains valid values
glGenBuffers(1,&buffer_object);
glBindBuffer(GL_UNIFORM_BUFFER,buffer_object);
glBufferData(GL_UNIFORM_BUFFER,sizeof(color_values),color_values,GL_STATIC_DRAW);
glUniformBlockBinding(psId,blockIndex,0);
glBindBufferRange(GL_UNIFORM_BUFFER,0,buffer_object,0,16);
glBindBufferRange(GL_UNIFORM_BUFFER,0,buffer_object,16,16);
Here what I am expecting is, to pass 16 bytes for each vec4 uniform. I get GL_INVALID_VALUE error for offset=16 , size = 16.
I am confused with offset value. Spec says it is corresponding to "buffer_object".
There is an alignment restriction for UBOs when binding. Any glBindBufferRange/Base's offset must be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT. This alignment could be anything, so you have to query it before building your array of uniform buffers. That means you can't do it directly in compile-time C++ logic; it has to be runtime logic.
Speaking of querying things at runtime, your code is horribly broken in many other ways. You did not define a layout qualifier for your uniform block; therefore, the default is used: shared. And you cannot use `shared* layout without querying the layout of each block's members from OpenGL. Ever.
If you had done a query, you would have quickly discovered that your uniform block is at least 32 bytes in size, not 16. And since you only provided 16 bytes in your range, undefined behavior (which includes the possibility of program termination) results.
If you want to be able to define C/C++ objects that map exactly to the uniform block definition, you need to use std140 layout and follow the rules of std140's layout in your C/C++ object.