GLSL: about coherent qualifier - opengl

I didn't get clearly how coherent qualifier and atomic operations work together.
I perform some accumulating operation on the same SSBO location with this code:
uint prevValue, newValue;
uint readValue = ssbo[index];
do
{
prevValue = readValue;
newValue = F(readValue);
}
while((readValue = atomicCompSwap(ssbo[index], prevValue, newValue)) != prevValue);
This code works fine for me, but still, do I need to declare the SSBO (or Image) with coherent qualifier in this case?
And do I need to use coherent in a case when I call only atomicAdd?
When exactly do I need to use coherent qualifier? Do I need to use it only in case of direct writing: ssbo[index] = value;?

TL;DR
I found evidence that supports both answers regarding coherent.
Current score:
Requiring coherent with atomics: 1.5
Omitting coherent with atomics: 5.75
Bottom line, still not sure despite the score. Inside a single workgroup, I'm mostly convinced coherent is not required in practice. I'm not so sure in these cases:
more than 1 workgroup in glDispatchCompute
multiple glDispatchCompute calls that all access the same memory location (atomically) without any glMemoryBarrier between them
However, is there a performance cost to declaring SSBOs (or individual struct members) coherent when you only access them through atomic operations? Based on what is below, I don't believe there is because coherent adds "visibility" instructions or instruction flags at the variable's read or write operations. If a variable is only accessed through atomic operations, the compiler should hopefully:
ignore coherent when generating the atomic instructions because it has no effect
use the appropriate mechanic to make sure the result of the atomic operation is visible outside the shader invocation, warp, workgroup or rendering command.
From the OpenGL wiki's "Memory Model" page:
Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent qualifiers, barriers, and the like. (removed on 2020-04-12)
However, if memory has been modified in an incoherent fashion, any subsequent reads from that memory are not automatically guaranteed to see these changes.
+1 for requiring coherent
The code from Intel's article "OpenGL Performance Tips: Atomic Counter Buffers versus Shader Storage Buffer Objects"
// Fragment shader used bor ACB gets output color from a texture
#version 430 core
uniform sampler2D texUnit;
layout(binding = 0) uniform atomic_uint acb[ s(nCounters) ];
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
void main()
{
for (int i=0; i< s(nCounters) ; ++i) atomicCounterIncrement(acb[i]);
fragColor = texture(texUnit, texcoord);
}
// Fragment shader used for SSBO gets output color from a texture
#version 430 core
uniform sampler2D texUnit;
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
layout(std430, binding = 0) buffer ssbo_data
{
uint v[ s(nCounters) ];
};
void main()
{
for (int i=0; i< s(nCounters) ; ++i) atomicAdd(v[i], 1);
fragColor = texture(texUnit, texcoord);
}
Notice that ssbo_data in the second shader is not declared coherent.
The article also states:
The OpenGL foundation recommends using [atomic counter buffers] over SSBOs for various reasons; however improved performance is not one of them. This is because ACBs are internally implemented as SSBO atomic operations; therefore there are no real performance benefits from utilizing ACBs.
So atomic counters are actually the same thing as SSBOs apparently. (But what are those "various reasons" and where are those recommendations? Is Intel hinting at a conspiracy in favor of atomic counters...?)
+1 for omitting coherent
GLSL specification
The GLSL spec uses different wording when describing coherent and atomic operations (emphasis mine):
(4.10) When accessing memory using variables not declared as coherent, the memory accessed by a shader may be cached by the implementation to service future accesses to the same address. Memory stores may be cached in such a way that the values written might not be visible to other shader invocations accessing the same memory. The implementation may cache the values fetched by memory reads and return the same values to any shader invocation accessing the same memory, even if the underlying memory has been modified since the first memory read.
(8.11) Atomic memory functions perform atomic operations on an individual signed or unsigned integer stored in buffer-object or shared-variable storage. All of the atomic memory operations read a value from memory, compute a new value using one of the operations described below, write the new value to memory, and return the original value read. The contents of the memory being updated by the atomic operation are guaranteed not to be modified by any other assignment or atomic memory function in any shader invocation between the time the original value is read and the time the new value is written.
All the built-in functions in this section accept arguments with combinations of restrict, coherent, and volatile memory qualification, despite not having them listed in the prototypes. The atomic operation will operate as required by the calling argument’s memory qualification, not by the built-in function’s formal parameter memory qualification.
So on the one hand atomic operations are supposed to work directly with the storage's memory (does that imply bypassing possible caches?). On the other hand, it seems that memory qualifications (e.g. coherent) play a role in what the atomic operation does.
+0.5 for requiring coherent
OpenGL specification
The OpenGL 4.6 spec sheds more light on this issue in section 7.13.1 "Shader Memory Access Ordering"
The built-in atomic memory transaction and atomic counter functions may be used to read and write a given memory address atomically. While built-in atomic functions issued by multiple shader invocations are executed in undefined order relative to each other, these functions perform both a read and a write of a memory address and guarantee that no other memory transaction will write to the underlying memory between the read and write. Atomics allow shaders to use shared global addresses for mutual exclusion or as counters, among other uses.
The intent of atomic operations then clearly seems to be, well, atomic all the time and not depending on a coherent qualifier. Indeed, why would one want an atomic operation that isn't somehow combined between different shader invocations? Incrementing a locally cached value from multiple invocations and having all of them eventually write a completely independent value makes no sense.
+1 for omitting coherent
OpenGL spec issue #14
OpenGL 4.6: Do atomic counter buffers require the use of glMemoryBarrier calls to be able to access the counter?
We discussed this again in the OpenGL|ES meeting. Based on feedback from IHVs and their implementation of atomic counters we're planning to treat them like we treat other resources like image atomic, image load/store, buffer variables, etc. in that they require explicit synchronization from the application. The spec will be changed to add "atomic counters" to the places where the other resources are enumerated.
The described spec change occurred in OpenGL 4.5 to 4.6, but relates to glMemoryBarrier which plays no part in inside a single glDispatchCompute.
no effect
Example Shader
Let's inspect the assembly produced by two simple shaders to see what happens in practice.
#version 460
layout(local_size_x = 512) in;
// Non-coherent qualified SSBO
layout(binding=0) restrict buffer Buf { uint count; } buf;
// Coherent qualified SSBO
layout(binding=1) coherent restrict buffer Buf_coherent { uint count; } buf_coherent;
void main()
{
// First shader with atomics (v1)
uint read_value1 = atomicAdd(buf.count, 2);
uint read_value2 = atomicAdd(buf_coherent.count, 4);
// Second shader with non-atomic add (v2)
buf.count += 2;
buf_coherent.count += 4;
}
The second shader is used to compare the effects of the coherent qualifier between atomic operations and non-atomic operations.
AMD
AMD publishes Instruction Set Architecture (ISA) Documents which coupled with the Radeon GPU Analyzer gives insight into how GPUs actually implement this.
Shader v1 (Vega gfx900)
s_getpc_b64 s[0:1] BE801C80
s_mov_b32 s0, s2 BE800002
s_mov_b64 s[2:3], exec BE82017E
s_ff1_i32_b64 s4, exec BE84117E
s_lshl_b64 s[4:5], 1, s4 8E840481
s_and_b64 s[4:5], s[4:5], exec 86847E04
s_and_saveexec_b64 s[4:5], s[4:5] BE842004
s_cbranch_execz label_0010 BF880008
s_load_dwordx4 s[8:11], s[0:1], 0x00 C00A0200 00000000
s_bcnt1_i32_b64 s2, s[2:3] BE820D02
s_mulk_i32 s2, 0x0002 B7820002
v_mov_b32 v0, s2 7E000202
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
label_0010:
s_mov_b64 exec, s[4:5] BEFE0104
s_mov_b64 s[2:3], exec BE82017E
s_ff1_i32_b64 s4, exec BE84117E
s_lshl_b64 s[4:5], 1, s4 8E840481
s_and_b64 s[4:5], s[4:5], exec 86847E04
s_and_saveexec_b64 s[4:5], s[4:5] BE842004
s_cbranch_execz label_001F BF880008
s_load_dwordx4 s[8:11], s[0:1], 0x20 C00A0200 00000020
s_bcnt1_i32_b64 s0, s[2:3] BE800D02
s_mulk_i32 s0, 0x0004 B7800004
v_mov_b32 v0, s0 7E000200
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
label_001F:
s_endpgm BF810000
(Don't know why the exec mask and branching is used here...)
We can see that both atomic operations (on coherent and non-coherent buffers) result in the same instruction on all supported architectures of the Radeon GPU Analyzer:
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
Decoding this instruction shows that the GLC (Globally Coherent) flag is set to 0 which means for atomic operations: "Previous data value is not returned. No L1 persistence across wavefronts". Modifying the shader to use the returned values changes the GLC flag of both atomic instructions to 1 which means: "Previous data value is returned. No L1 persistence across wavefronts".
The documents dating from 2013 (Sea Islands, etc.) have an interesting description of the BUFFER_ATOMIC_<op> instructions:
Buffer object atomic operation. Always globally coherent.
So on AMD hardware, it appears coherent has no effect for atomic operations.
Shader v2 (Vega gfx900)
s_getpc_b64 s[0:1] BE801C80
s_mov_b32 s0, s2 BE800002
s_load_dwordx4 s[4:7], s[0:1], 0x00 C00A0100 00000000
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_load_dword v0, v0, s[4:7], 0 E0500000 80010000
s_load_dwordx4 s[0:3], s[0:1], 0x20 C00A0000 00000020
s_waitcnt vmcnt(0) BF8C0F70
v_add_u32 v0, 2, v0 68000082
buffer_store_dword v0, v0, s[4:7], 0 glc E0704000 80010000
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_load_dword v0, v0, s[0:3], 0 glc E0504000 80000000
s_waitcnt vmcnt(0) BF8C0F70
v_add_u32 v0, 4, v0 68000084
buffer_store_dword v0, v0, s[0:3], 0 glc E0704000 80000000
s_endpgm BF810000
The buffer_load_dword operation on the coherent buffer uses the glc flag and the other one does not as expected.
On AMD: +1 for omitting coherent
NVIDIA
It's possible to get the assembly of a shader by inspecting the blob returned by glGetProgramBinary(). The instructions are described in NV_gpu_program4, NV_gpu_program5 and NV_gpu_program5_mem_extended.
Shader v1
!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
ATOMB.ADD.U32 R0.x, {2, 0, 0, 0}, sbo_buf0[0];
ATOMB.ADD.U32 R0.x, {4, 0, 0, 0}, sbo_buf1[0];
END
There is no difference whether coherent is present or not.
Shader v2
!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
LDB.U32 R0.x, sbo_buf0[0];
ADD.U R0.x, R0, {2, 0, 0, 0};
STB.U32 R0, sbo_buf0[0];
LDB.U32.COH R0.x, sbo_buf1[0];
ADD.U R0.x, R0, {4, 0, 0, 0};
STB.U32 R0, sbo_buf1[0];
END
The LDB.U32 operation on the coherent buffer uses the COH modifier which means "Make LOAD and STORE operations use coherent caching".
On NVIDIA: +1 for omitting coherent
SPIR-V (with Vulkan target)
Let's see what SPIR-V code is generated by the glslang SPIR-V generator.
Shader v1
// Generated with glslangValidator.exe -H --target-env vulkan1.1
// Module Version 10300
// Generated by (magic number): 80008
// Id's are bound by 30
Capability Shader
1: ExtInstImport "GLSL.std.450"
MemoryModel Logical GLSL450
EntryPoint GLCompute 4 "main"
ExecutionMode 4 LocalSize 512 1 1
Source GLSL 460
Name 4 "main"
Name 8 "read_value1"
Name 9 "Buf"
MemberName 9(Buf) 0 "count"
Name 11 "buf"
Name 20 "read_value2"
Name 21 "Buf_coherent"
MemberName 21(Buf_coherent) 0 "count"
Name 23 "buf_coherent"
MemberDecorate 9(Buf) 0 Restrict
MemberDecorate 9(Buf) 0 Offset 0
Decorate 9(Buf) Block
Decorate 11(buf) DescriptorSet 0
Decorate 11(buf) Binding 0
MemberDecorate 21(Buf_coherent) 0 Coherent
MemberDecorate 21(Buf_coherent) 0 Restrict
MemberDecorate 21(Buf_coherent) 0 Offset 0
Decorate 21(Buf_coherent) Block
Decorate 23(buf_coherent) DescriptorSet 0
Decorate 23(buf_coherent) Binding 1
Decorate 29 BuiltIn WorkgroupSize
2: TypeVoid
3: TypeFunction 2
6: TypeInt 32 0
7: TypePointer Function 6(int)
9(Buf): TypeStruct 6(int)
10: TypePointer StorageBuffer 9(Buf)
11(buf): 10(ptr) Variable StorageBuffer
12: TypeInt 32 1
13: 12(int) Constant 0
14: TypePointer StorageBuffer 6(int)
16: 6(int) Constant 2
17: 6(int) Constant 1
18: 6(int) Constant 0
21(Buf_coherent): TypeStruct 6(int)
22: TypePointer StorageBuffer 21(Buf_coherent)
23(buf_coherent): 22(ptr) Variable StorageBuffer
25: 6(int) Constant 4
27: TypeVector 6(int) 3
28: 6(int) Constant 512
29: 27(ivec3) ConstantComposite 28 17 17
4(main): 2 Function None 3
5: Label
8(read_value1): 7(ptr) Variable Function
20(read_value2): 7(ptr) Variable Function
15: 14(ptr) AccessChain 11(buf) 13
19: 6(int) AtomicIAdd 15 17 18 16
Store 8(read_value1) 19
24: 14(ptr) AccessChain 23(buf_coherent) 13
26: 6(int) AtomicIAdd 24 17 18 25
Store 20(read_value2) 26
Return
FunctionEnd
The only difference between buf and buf_coherent is the decoration of the latter with MemberDecorate 21(Buf_coherent) 0 Coherent. Their usage afterwards is identical.
Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:
Capability Shader
+ Capability VulkanMemoryModelKHR
+ Extension "SPV_KHR_vulkan_memory_model"
1: ExtInstImport "GLSL.std.450"
- MemoryModel Logical GLSL450
+ MemoryModel Logical VulkanKHR
EntryPoint GLCompute 4 "main"
Decorate 11(buf) Binding 0
- MemberDecorate 21(Buf_coherent) 0 Coherent
MemberDecorate 21(Buf_coherent) 0 Restrict
which means... I don't quite know because I'm not versed in Vulkan's intricacies. I did found this informative section of the "Memory Model" appendix in the Vulkan 1.2 spec:
While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations. GLSL to SPIR-V compilers should map all (non-atomic) operations on a coherent variable to Make{Pointer,Texel}{Available}{Visible} flags in this model.
Atomic operations implicitly have availability/visibility operations, and the scope of those operations is taken from the atomic operation’s scope.
Shader v2
(skipping full output)
The only difference between buf and buf_coherent is again MemberDecorate 18(Buf_coherent) 0 Coherent.
Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:
- MemberDecorate 18(Buf_coherent) 0 Coherent
- 23: 6(int) Load 22
- 24: 6(int) IAdd 23 21
- 25: 13(ptr) AccessChain 20(buf_coherent) 11
- Store 25 24
+ 23: 6(int) Load 22 MakePointerVisibleKHR NonPrivatePointerKHR 24
+ 25: 6(int) IAdd 23 21
+ 26: 13(ptr) AccessChain 20(buf_coherent) 11
+ Store 26 25 MakePointerAvailableKHR NonPrivatePointerKHR 24
Notice the addition of MakePointerVisibleKHR and MakePointerAvailableKHR that control operation coherency at the instruction level instead of the variable level.
+1 for omitting coherent (maybe?)
CUDA
The Parallel Thread Execution ISA section of the CUDA Toolkit documentation has this information:
8.5. Scope
Each strong operation must specify a scope, which is the set of threads that may interact directly with that operation and establish any of the relations described in the memory consistency model. There are three scopes:
Table 18. Scopes
.cta: The set of all threads executing in the same CTA as the current thread.
.gpu: The set of all threads in the current program executing on the same compute device as the current thread. This also includes other kernel grids invoked by the host program on the same compute device.
.sys The set of all threads in the current program, including all kernel grids invoked by the host program on all compute devices, and all threads constituting the host program itself.
Note that the warp is not a scope; the CTA is the smallest collection of threads that qualifies as a scope in the memory consistency model.
Regarding CTA:
A cooperative thread array (CTA) is a set of concurrent threads that execute the same kernel program. A grid is a set of CTAs that execute independently.
So in GLSL terms, CTA == work group and grid == glDispatchCompute call.
The atom instruction description:
9.7.12.4. Parallel Synchronization and Communication Instructions: atom
Atomic reduction operations for thread-to-thread communication.
[...]
The optional .scope qualifier specifies the set of threads that can directly observe the memory synchronizing effect of this operation, as described in the Memory Consistency Model.
[...]
If no scope is specified, the atomic operation is performed with .gpu scope.
So by default, all shader invocations of a glDispatchCompute would see the result of an atomic operation... unless the GLSL compiler generates something that uses the cta scope in which case it would only be visible inside the workgroup. This latter case however corresponds to shared GLSL variables so perhaps it's only used for those and not for SSBO operations. NVIDIA isn't very open about this process so I haven't found a way to tell for sure (perhaps with glGetProgramBinary). However, since the semantics of cta map to a work group and gpu to buffers (i.e. SSBO, images, etc), I declare:
+0.5 for omitting coherent
Empirical evidence
I have written a particle system compute shader that uses an SSBO backed variable as an operand to atomicAdd() and it works. Usage of of coherent was not necessary even with a work group size of 512. However, there was never more than 1 work group. This was tested mainly on an Nvidia GTX 1080 so as seen above, atomic operations on NVIDIA seem to always be at least visible inside the work group.
+0.25 for omitting coherent

Related

Aliasing a SSBO by binding it multiple times in the same shader

Playing around with bindless rendering, I have one big static SSBO that holds my vertex data. The vertices are packed in memory as a contiguous array where each vertex has the following layout:
| Position (floats) | Normal (snorm shorts) | Pad |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| P.x | P.y | P.z | N.x | N.y | N.z | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| float | float | float | uint | uint |
Note how each vertex is 20 bytes / 5 "words" / 1.25 vec4s. Not exactly a round number for a GPU. So instead of doing a bunch of padding and using uneccessary memory, I have opted to unpack the data "manually".
Vertex shader:
...
layout(std430, set = 0, binding = 1)
readonly buffer FloatStaticBuffer
{
float staticBufferFloats[];
};
layout(std430, set = 0, binding = 1) // Using the same binding?!
readonly buffer UintStaticBuffer
{
uint staticBufferUInts[];
};
...
void main()
{
const uint vertexBaseDataI = gl_VertexIndex * 5u;
// Unpack position
const vec3 position = vec3(
staticBufferFloats[vertexBaseDataI + 0u],
staticBufferFloats[vertexBaseDataI + 1u],
staticBufferFloats[vertexBaseDataI + 2u]);
// Unpack normal
const vec3 normal = vec3(
unpackSnorm2x16(staticBufferUInts[vertexBaseDataI + 3u]),
unpackSnorm2x16(staticBufferUInts[vertexBaseDataI + 4u]).x);
...
}
It is awfully convenient to be able to "alias" the buffer as both float and uint data.
The question: is "aliasing" a SSBO this way a terrible idea, and I'm just getting lucky, or is this actually a valid option that would work across platforms?
Alternatives:
Use just one buffer, say staticBufferUInts, and then use uintBitsToFloat to extract the positions. Not a big deal, but might have a small performance cost?
Bind the same buffer twice on the CPU to two different bindings. Again, not a big deal, just slightly annoying.
Vulkan allows incompatible resources to alias in memory as long as no malformed values are read from it. (Actually, I think it's allowed even when you read from the invalid sections - you should just get garbage. But I can't find the section of the standard right now that spells this out. The Vulkan standard is way too complicated.)
From the standard, section "Memory Aliasing":
Otherwise, the aliases interpret the contents of the memory
differently, and writes via one alias make the contents of memory
partially or completely undefined to the other alias. If the first alias is a host-accessible subresource, then the bytes affected are those written by the memory operations according to its addressing scheme. If the first alias is not host-accessible, then the bytes
affected are those overlapped by the image subresources that were
written. If the second alias is a host-accessible subresource, the
affected bytes become undefined. If the second alias is not
host-accessible, all sparse image blocks (for sparse
partially-resident images) or all image subresources (for non-sparse
image and fully resident sparse images) that overlap the affected
bytes become undefined.
Note that the standard talks about bytes being written and becoming undefined in aliasing resources. It's not the entire resource that becomes invalid.
Let's see it this way: You have two aliasing SSBOs (in reality just one that's bound twice) with different types (float, short int). Any bytes that you wrote floats into became valid in the "float view" and invalid in the "int view" the moment you wrote into the buffer. The same goes for the ints: The bytes occupied by them have become valid in the int view but invalid in the float view. According to the standard, this means that both views have invalid sections in them; however, neither of them is fully invalid. In particular, the sections you care about are still valid and may be read from.
In short: It's allowed.

What exact rules in the C++ memory model prevent reordering before acquire operations?

I have a question regarding the order of operations in the following code:
std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
y.exchange(1, std::memory_order_acq_rel);
r1 = x.load(std::memory_order_relaxed);
}
void thread2() {
x.exchange(1, std::memory_order_acq_rel);
r2 = y.load(std::memory_order_relaxed);
}
Given the description of std::memory_order_acquire on the cppreference page (https://en.cppreference.com/w/cpp/atomic/memory_order), that
A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load.
it seems obvious that there can never be an outcome that r1 == 0 && r2 == 0 after running thread1 and thread2 concurrently.
However, I cannot find any wording in the C++ standard (looking at the C++14 draft right now), which establishes guarantees that two relaxed loads cannot be reordered with acquire-release exchanges. What am I missing?
EDIT: As has been suggested in the comments, it is actually possible to get both r1 and r2 equal to zero. I've updated the program to use load-acquire as follows:
std::atomic<int> x;
std::atomic<int> y;
int r1;
int r2;
void thread1() {
y.exchange(1, std::memory_order_acq_rel);
r1 = x.load(std::memory_order_acquire);
}
void thread2() {
x.exchange(1, std::memory_order_acq_rel);
r2 = y.load(std::memory_order_acquire);
}
Now is it possible to get both and r1 and r2 equal to 0 after concurrently executing thread1 and thread2? If not, which C++ rules prevent this?
The standard does not define the C++ memory model in terms of how operations are ordered around atomic operations with a specific ordering parameter.
Instead, for the acquire/release ordering model, it defines formal relationships such as "synchronizes-with" and "happens-before" that specify how data is synchronized between threads.
N4762, §29.4.2 - [atomics.order]
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M
and takes its value from any side effect in the release sequence headed by A.
In §6.8.2.1-9, the standard also states that if a store A synchronizes with a load B, anything sequenced before A inter-thread "happens-before" anything sequenced after B.
No "synchronizes-with" (and hence inter-thread happens-before) relationship is established in your second example (the first is even weaker) because the runtime relationships (that check the return values from the loads) are missing.
But even if you did check the return value, it would not be helpful since the exchange operations do not actually 'release' anything (i.e. no memory operations are sequenced before those operations).
Neiter do the atomic load operations 'acquire' anything since no operations are sequenced after the loads.
Therefore, according to the standard, each of the four possible outcomes for the loads in both examples (including 0 0) is valid.
In fact, the guarantees given by the standard are no stronger than memory_order_relaxed on all operations.
If you want to exclude the 0 0 result in your code, all 4 operations must use std::memory_order_seq_cst. That guarantees a single total order of the involved operations.
You already have an answer to the language-lawyer part of this. But I want to answer the related question of how to understand why this can be possible in asm on a possible CPU architecture that uses LL/SC for RMW atomics.
It doesn't make sense for C++11 to forbid this reordering: it would require a store-load barrier in this case where some CPU architectures could avoid one.
It might actually be possible with real compilers on PowerPC, given the way they map C++11 memory-orders to asm instructions.
On PowerPC64, a function with an acq_rel exchange and an acquire load (using pointer args instead of static variables) compiles as follows with gcc6.3 -O3 -mregnames. This is from a C11 version because I wanted to look at clang output for MIPS and SPARC, and Godbolt's clang setup works for C11 <atomic.h> but fails for C++11 <atomic> when you use -target sparc64.
#include <stdatomic.h> // This is C11, not C++11, for Godbolt reasons
long foo(_Atomic long *a, _Atomic int *b) {
atomic_exchange_explicit(b, 1, memory_order_acq_rel);
//++*a;
return atomic_load_explicit(a, memory_order_acquire);
}
(source + asm on Godbolt for MIPS32R6, SPARC64, ARM 32, and PowerPC64.)
foo:
lwsync # with seq_cst exchange this is full sync, not just lwsync
# gone if we use exchage with mo_acquire or relaxed
# so this barrier is providing release-store ordering
li %r9,1
.L2:
lwarx %r10,0,%r4 # load-linked from 0(%r4)
stwcx. %r9,0,%r4 # store-conditional 0(%r4)
bne %cr0,.L2 # retry if SC failed
isync # missing if we use exchange(1, mo_release) or relaxed
ld %r3,0(%r3) # 64-bit load double-word of *a
cmpw %cr7,%r3,%r3
bne- %cr7,$+4 # skip over the isync if something about the load? PowerPC is weird
isync # make the *a load a load-acquire
blr
isync is not a store-load barrier; it only requires the preceding instructions to complete locally (retire from the out-of-order part of the core). It doesn't wait for the store buffer to be flushed so other threads can see the earlier stores.
Thus the SC (stwcx.) store that's part of the exchange can sit in the store buffer and become globally visible after the pure acquire-load that follows it. In fact, another Q&A already asked this, and the answer is that we think this reordering is possible. Does `isync` prevent Store-Load reordering on CPU PowerPC?
If the pure load is seq_cst, PowerPC64 gcc puts a sync before the ld. Making the exchange seq_cst does not prevent the reordering. Remember that C++11 only guarantees a single total order for SC operations, so the exchange and the load both need to be SC for C++11 to guarantee it.
So PowerPC has a bit of an unusual mapping from C++11 to asm for atomics. Most systems put the heavier barriers on stores, allowing seq-cst loads to be cheaper or only have a barrier on one side. I'm not sure if this was required for PowerPC's famously-weak memory ordering, or if another choice was possible.
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows some possible implementations on various architectures. It mentions multiple alternatives for ARM.
On AArch64, we get this for the question's original C++ version of thread1:
thread1():
adrp x0, .LANCHOR0
mov w1, 1
add x0, x0, :lo12:.LANCHOR0
.L2:
ldaxr w2, [x0] # load-linked with acquire semantics
stlxr w3, w1, [x0] # store-conditional with sc-release semantics
cbnz w3, .L2 # retry until exchange succeeds
add x1, x0, 8 # the compiler noticed the variables were next to each other
ldar w1, [x1] # load-acquire
str w1, [x0, 12] # r1 = load result
ret
The reordering can't happen there because AArch64 acquire loads interact with release stores to give sequential consistency, not just plain acq/rel. Release stores can't reorder with later acquire loads.
(They can reorder with later plain loads, on paper and probably in some real hardware. AArch64 seq_cst can be cheaper than on other ISAs, if you avoid acquire loads right after release stores.
But unfortunately it makes acq/rel worse than x86. This is fixed with ARMv8.3-A LDAPR, a load that's just acquire not sequential-acquire. It allows earlier stores, even STLR, to reorder with it. So you get just acq_rel, allowing StoreLoad reordering but not other reordering. (It's also an optional feature in ARMv8.2-A).)
On a machine that also or instead had plain-release LL/SC atomics, it's easy to see that an acq_rel doesn't stop later loads to different cache lines from becoming globally visible after the LL but before the SC of the exchange.
If exchange is implemented with a single transaction like on x86, so the load and store are adjacent in the global order of memory operations, then certainly no later operations can be reordered with an acq_rel exchange and it's basically equivalent to seq_cst.
But LL/SC doesn't have to be a true atomic transaction to give RMW atomicity for that location.
In fact, a single asm swap instruction could have relaxed or acq_rel semantics. SPARC64 needs membar instructions around its swap instruction, so unlike x86's xchg it's not seq-cst on its own. (SPARC has really nice / human readable instruction mnemonics, especially compared to PowerPC. Well basically anything is more readable that PowerPC.)
Thus it doesn't make sense for C++11 to require that it did: it would hurt an implementation on a CPU that didn't otherwise need a store-load barrier.
in Release-Acquire ordering for create synchronization point between 2 threads we need some atomic object M which will be the same in both operations
An atomic operation A that performs a release operation on an
atomic object M synchronizes with an atomic operation B
that performs an acquire operation on M and takes its value from any
side effect in the release sequence headed by A.
or in more details:
If an atomic store in thread A is tagged memory_order_release
and an atomic load in thread B from the same variable is tagged
memory_order_acquire, all memory writes (non-atomic and relaxed
atomic) that happened-before the atomic store from the point of view
of thread A, become visible side-effects in thread B. That
is, once the atomic load is completed, thread B is guaranteed to
see everything thread A wrote to memory.
The synchronization is established only between the threads releasing
and acquiring the same atomic variable.
N = u | if (M.load(acquire) == v) :[B]
[A]: M.store(v, release) | assert(N == u)
here synchronization point on M store-release and load-acquire(which take value from store-release !). as result store N = u in thread A (before store-release on M) visible in B (N == u) after load-acquire on same M
if take example:
atomic<int> x, y;
int r1, r2;
void thread_A() {
y.exchange(1, memory_order_acq_rel);
r1 = x.load(memory_order_acquire);
}
void thread_B() {
x.exchange(1, memory_order_acq_rel);
r2 = y.load(memory_order_acquire);
}
what we can select for common atomic object M ? say x ? x.load(memory_order_acquire); will be synchronization point with x.exchange(1, memory_order_acq_rel) ( memory_order_acq_rel include memory_order_release (more strong) and exchange include store) if x.load load value from x.exchange and main will be synchronized loads after acquire (be in code after acquire nothing exist) with stores before release (but again before exchange nothing in code).
correct solution (look for almost exactly question ) can be next:
atomic<int> x, y;
int r1, r2;
void thread_A()
{
x.exchange(1, memory_order_acq_rel); // [Ax]
r1 = y.exchange(1, memory_order_acq_rel); // [Ay]
}
void thread_B()
{
y.exchange(1, memory_order_acq_rel); // [By]
r2 = x.exchange(1, memory_order_acq_rel); // [Bx]
}
assume that r1 == 0.
All modifications to any particular atomic variable occur in a total
order that is specific to this one atomic variable.
we have 2 modification of y : [Ay] and [By]. because r1 == 0 this mean that [Ay] happens before [By] in total modification order of y. from this - [By] read value stored by [Ay]. so we have next:
A is write to x - [Ax]
A do store-release [Ay] to y after this ( acq_rel include release,
exchange include store)
B load-acquire from y ([By] value stored by [Ay]
once the atomic load-acquire (on y) is completed, thread B is
guaranteed to see everything thread A wrote to memory before
store-release (on y). so it view side-effect of [Ax] - and r2 == 1
another possible solution use atomic_thread_fence
atomic<int> x, y;
int r1, r2;
void thread_A()
{
x.store(1, memory_order_relaxed); // [A1]
atomic_thread_fence(memory_order_acq_rel); // [A2]
r1 = y.exchange(1, memory_order_relaxed); // [A3]
}
void thread_B()
{
y.store(1, memory_order_relaxed); // [B1]
atomic_thread_fence(memory_order_acq_rel); // [B2]
r2 = x.exchange(1, memory_order_relaxed); // [B3]
}
again because all modifications of atomic variable y occur in a total order. [A3] will be before [B1] or visa versa.
if [B1] before [A3] - [A3] read value stored by [B1] => r1 == 1.
if [A3] before [B1] - the [B1] is read value stored by [A3]
and from Fence-fence synchronization:
A release fence [A2] in thread A synchronizes-with an acquire fence [B2] in thread B, if:
There exists an atomic object y,
There exists an atomic write [A3] (with any memory order) that
modifies y in thread A
[A2] is sequenced-before [A3] in thread A
There exists an atomic read [B1] (with any memory order) in thread
B
[B1] reads the value written by [A3]
[B1] is sequenced-before [B2] in thread B
In this case, all stores ([A1]) that are sequenced-before [A2] in thread A will happen-before all loads ([B3]) from the same locations (x) made in thread B after [B2]
so [A1] (store 1 to x) will be before and have visible effect for [B3] (load form x and save result to r2 ). so will be loaded 1 from x and r2==1
[A1]: x = 1 | if (y.load(relaxed) == 1) :[B1]
[A2]: ### release ### | ### acquire ### :[B2]
[A3]: y.store(1, relaxed) | assert(x == 1) :[B3]
As language lawyer reasonings are hard to follow, I thought I'd add how a programmer who understands atomics would reason about the second snippet in your question:
Since this is symmetrical code, it is enough to look at just one side.
Since the question is about the value of r1 (r2), we start with looking at
r1 = x.load(std::memory_order_acquire);
Depending on what the value of r1 is, we can say something about the visibility of other values. However, since the value of r1 isn't tested - the acquire is irrelevant.
In either case, the value of r1 can be any value that was ever written to it (in the past or future *)). Therefore it can be zero. Nevertheless, we can assume it to BE zero because we're interested in whether or not the outcome of the whole program can be 0 0, which is a sort of testing the value of r1.
Hence, assuming we read zero THEN we can say that if that zero was written by another thread with memory_order_release then every other write to memory done by that thread before the store release will also be visible to this thread. However, value of zero that we read is the initialization value of x, and initialization values are non-atomic - let alone a 'release' - and certainly there wasn't anything "ordered" in front of them in terms of writing that value to memory; so there is nothing we can say about the visibility of other memory locations. In other words, again, the 'acquire' is irrelevant.
So, we can get r1 = 0 and the fact that we used acquire is irrelevant. The same reasoning then holds of r2. So the result can be r1 = r2 = 0.
In fact, if you assume the value of r1 is 1 after the load acquire, and that that 1 was written by thread2 with memory order release (which MUST be the case, since is the only place where a value of 1 is ever written to x) then all we know is that everything written to memory by thread2 before that store release will also be visible to thread1 (provided thread1 read x == 1 thus!). But thread2 doesn't write ANYTHING before writing to x, so again the whole release-acquire relationship is irrelevant, even in the case of loading a value of 1.
*) However, it is possible with further reasoning to show that certain value can never occur because of inconsistency with the memory model - but that doesn't happen here.
In the original version, it is possible to see r1 == 0 && r2 == 0 because there is no requirement that the stores propogate to the other thread before it reads it. This is not a re-ordering of either thread's operations, but e.g. a read of stale cache.
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 0; | y == 0;
y.exchange(1, std::memory_order_acq_rel); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2
The release on Thread 1 is ignored by Thread 2, and vice-versa. In the abstract machine there is not consistency with the values of x and y on the threads
Thread 1's cache | Thread 2's cache
x == 0; // stale | x == 1;
y == 1; | y == 0; // stale
r1 = x.load(std::memory_order_relaxed); // Thread 1
r2 = y.load(std::memory_order_relaxed); // Thread 2
You need more threads to get "violations of causality" with acquire / release pairs, as the normal ordering rules, combined with the "becomes visible side effect in" rules force at least one of the loads to see 1.
Without loss of generality, let's assume that Thread 1 executes first.
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 0; | y == 0;
y.exchange(1, std::memory_order_acq_rel); // Thread 1
Thread 1's cache | Thread 2's cache
x == 0; | x == 0;
y == 1; | y == 1; // sync
The release on Thread 1 forms a pair with the acquire on Thread 2, and the abstract machine describes a consistent y on both threads
r1 = x.load(std::memory_order_relaxed); // Thread 1
x.exchange(1, std::memory_order_acq_rel); // Thread 2
r2 = y.load(std::memory_order_relaxed); // Thread 2
I try to explain it in other word.
Imaginate that each thread running in the different CPU Core simultaneously, thread1 running in Core A, and thread2 running in Core B.
The core B cannot know the REAL running order in core A. The meaning of memory order is just, the running result to show to core B, from core A.
std::atomic<int> x, y;
int r1, r2, var1, var2;
void thread1() { //Core A
var1 = 99; //(0)
y.exchange(1, std::memory_order_acq_rel); //(1)
r1 = x.load(std::memory_order_acquire); //(2)
}
void thread2() { //Core B
var2 = 999; //(2.5)
x.exchange(1, std::memory_order_acq_rel); //(3)
r2 = y.load(std::memory_order_acquire); //(4)
}
For example, the (4) is just a REQUEST for (1). (which have code like 'variable y with memory_order_release')
And (4) in core B apply A for a specific order: (0)->(1)->(4).
For different REQUEST, they may see different sequence in other thread.
( If now we have core C and some atomic variable interactive with core A, the core C may seen different result with core B.)
OK now there's a detail explaination step by step: (for code above)
We start in core B : (2.5)
(2.5)var2 = 999;
(3)acq: find variable 'x' with 'memory_order_release', nothing. Now the order in Core A we can guess [(0),(1),(2)] or [(0),(2),(1)] are all legal, so there's no limit to us (B) to reorder (3) and (4).
(3)rel: find var 'x' with 'memory_order_acquire', found (2), so make a ordered show list to core A : [var2=999, x.exchange(1)]
(4) find var y with 'memory_order_release', ok found it at (1). So now we stand at core B, we can see the source code which Core displayed to me: 'There's must have var1=99 before y.exchange(1)'.
The idea is: we can see the source code which have var1=99 before y.exchange(1), because we make a REQUEST to other cores, and core A response that result to me. (the REQUEST is y.load(std::acquire)) If there have some other core who also want to observe the source code of A, they can not find that conclusion.
We can never know the real running order for (0) (1) (2).
The order for A itself can ensure the right result (seems like singal thread)
The request from B also don't have any affect on the real running order for A.
This is also applied for B (2.5) (3) (4)
That is, the operation for specific core really do, but didn't tell other cores, so the 'local cache in other cores' might be wrong.
So there have a chance for (0, 0) with the code in question.

Is it allowed to use a VkBool32 as a push constant?

I am trying to create a VkBool32 in my C++ code:
VkBool32 myBool = VK_FALSE;
and push it to GLSL via a push constant:
vkCmdPushConstants(..., sizeof(myBool), &myBool);
which is recieved by a bool inside a uniform storage class:
layout(push_constant) uniform PushConstants
{
bool myBool;
} pushConts;
First tests seem to work and have the intended behaviour. But is this permitted by the Vulkan Spec?
Using bools for push constants is fine. There is nothing in the specs that prohibits this and I'v been using it in a few examples too.
If you take a look at the human-readable SPIR-V output you'll see that they're converted to 32 bit integers and thus are aligned to 32 bit:
GLSL
layout (push_constant) uniform PushConsts {
bool calculateNormals;
} pushConsts;
SPIR-V
430(PushConsts): TypeStruct 40(int)
431: TypePointer PushConstant 430(PushConsts)
432(pushConsts): 431(ptr) Variable PushConstant
433: TypePointer PushConstant 40(int)
So if you e.g. would pass a struct containing multiple booleans you'd have to properly align (pad) on the CPU side before passing as a push constant.
As for the SPIR-V side of things, the official spec is always a good starting point and also contains details on how push constants are handled and how they differ.

GL_SHADER_STORAGE_BUFFER memory limitations

I'm writing ray-tracing on OGL computing shaders, to pass data to and from shaders I use buffers.
When size of vec2 output buffer (which is equal to number of rays multiplied by number of faces) reaches ~30Mb attempt of mapping buffer is stable returning NULL pointer. Range mapping also fails.
I can't find any info about GL_SHADER_STORAGE_BUFFER limitations in ogl documentation, but maybe someone can help me, is ~30Mb limit or this mapping-fail may happen because of something different?
And is there any way to avoid this except for calling shader multiple times?
Data declaration in shader:
#version 440
layout(std430, binding=0) buffer rays{
vec4 r[];
};
layout(std430, binding=1) buffer faces{
vec4 f[];
};
layout(std430, binding=2) buffer outputs{
vec2 o[];
};
uniform int face_count;
uniform vec4 origin;
Calling code (using some Qt5 wrappers):
QOpenGLBuffer ray_buffer;
QOpenGLBuffer face_buffer;
QOpenGLBuffer output_buffer;
QVector<QVector2D> output;
output.resize(rays[r].size()*faces.size());
if(!ray_buffer.create()) { /*...*/ }
if(!ray_buffer.bind()) { /*...*/ }
ray_buffer.allocate(rays.data(), rays.size()*sizeof(QVector4D));
if(!face_buffer.create()) { /*...*/ }
if(!face_buffer.bind()) { /*...*/ }
face_buffer.allocate(faces.data(), faces.size()*sizeof(QVector4D));
if(!output_buffer.create()) { /*...*/ }
if(!output_buffer.bind()) { /*...*/ }
output_buffer.allocate(output.size()*sizeof(QVector2D));
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ray_buffer.bufferId());
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, face_buffer.bufferId());
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, output_buffer.bufferId());
int face_count = faces.size();
compute.setUniformValue("face_count", face_count);
compute.setUniformValue("origin", pos);
ogl->glDispatchCompute(rays.size()/256, faces.size(), 1);
ray_buffer.destroy();
face_buffer.destroy();
QVector2D* data = (QVector2D*)output_buffer.map(QOpenGLBuffer::ReadOnly);
First of all, you have to understand that the OpenGL specification defines minimum maxima for a variety of values (the ones starting with a MAX_{*} prefix). That means that implementations are required to at least provide the specified amount as the maximum value, but are free to increase the limit as implementors see fit. This way, developers can at least rely on some upper bound, but can still make provisions for possibly larger values.
Section 23 - State Tables summarizes what has been previously specified in the corresponding sections. The information you were looking for is found in table 23.64 - Implementation Dependent Aggregate Shader Limits (cont.). If you want to know about which state belongs where (because there is per-object state, quasi-global state, program state and so on), you go to section 23.
The minimum maximum size of a shader storage buffer is represented by the symbolic constant MAX_SHADER_STORAGE_BLOCK_SIZE as per section 7.8 of the core OpenGL 4.5 specification.
Since their adoption into core, the required size (i.e. the minimum maximum) has been significantly increased. In core OpenGL 4.3 and 4.4, the minimum maximum was pow(2, 24) (or 16MB with 1 byte basic machine units and 1MB = 1024^2 bytes) - in core OpenGL 4.5 this value is now pow(2, 27) (or 128MB)
Summary: When in doubt about OpenGL state, refer to section 23 of the core specification.
From OpenGL Wiki:
SSBOs can be much larger. The OpenGL spec guarantees that UBOs can be
up to 16KB in size (implementations can allow them to be bigger). The
spec guarantees that SSBOs can be up to 128MB. Most implementations
will let you allocate a size up to the limit of GPU memory.
OpenGL < 4.5 guarantees only 16MiB (OpenGL 4.5 increased the minimum to 128MiB) , you can try using glGet() to query if you can bind more.
GLint64 max;
glGetInteger64v(GL_MAX_SHADER_STORAGE_BLOCK_SIZE, &max);
In fact problem seems to be in Qt wrappers. Didn't look in-depth, but when I've changed QOpenGLBuffer's create(), bind(), allocate() and map() to glCreateBuffers(), glBindBuffer(), glNamedBufferData() and glMapNamedBuffer(), all called through QOpenGLFunctions_4_5_Core, memory problem was gone until I reached 2Gb (which is GPU physical memory limit).
Second error I've made was not using glMemoryBarrier(), but it didn't help while QOpenGLBuffer was in use.

Issue with glBindBufferRange() OpenGL 3.1

My vertex shader is ,
uniform Block1{ vec4 offset_x1; vec4 offset_x2;}block1;
out float value;
in vec4 position;
void main()
{
value = block1.offset_x1.x + block1.offset_x2.x;
gl_Position = position;
}
The code I am using to pass values is :
GLfloat color_values[8];// contains valid values
glGenBuffers(1,&buffer_object);
glBindBuffer(GL_UNIFORM_BUFFER,buffer_object);
glBufferData(GL_UNIFORM_BUFFER,sizeof(color_values),color_values,GL_STATIC_DRAW);
glUniformBlockBinding(psId,blockIndex,0);
glBindBufferRange(GL_UNIFORM_BUFFER,0,buffer_object,0,16);
glBindBufferRange(GL_UNIFORM_BUFFER,0,buffer_object,16,16);
Here what I am expecting is, to pass 16 bytes for each vec4 uniform. I get GL_INVALID_VALUE error for offset=16 , size = 16.
I am confused with offset value. Spec says it is corresponding to "buffer_object".
There is an alignment restriction for UBOs when binding. Any glBindBufferRange/Base's offset must be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT. This alignment could be anything, so you have to query it before building your array of uniform buffers. That means you can't do it directly in compile-time C++ logic; it has to be runtime logic.
Speaking of querying things at runtime, your code is horribly broken in many other ways. You did not define a layout qualifier for your uniform block; therefore, the default is used: shared. And you cannot use `shared* layout without querying the layout of each block's members from OpenGL. Ever.
If you had done a query, you would have quickly discovered that your uniform block is at least 32 bytes in size, not 16. And since you only provided 16 bytes in your range, undefined behavior (which includes the possibility of program termination) results.
If you want to be able to define C/C++ objects that map exactly to the uniform block definition, you need to use std140 layout and follow the rules of std140's layout in your C/C++ object.