Sorry for the title, but I do not really know how I can name my problem.
I am reading about uniform blocks in a opengl book and I am a bit confused about default std140 offsets shown there.
layout(std140) uniform TransformBlock
{
//component base alignment | offset | aligned offset
float scale; // 4 | 0 | 0
vec3 translation; // 16 | 4 | 16
float rotation[3]; // 16 | 28 | 32 (rotation[0])
// 48 (rotation[1])
// 64 (rotation[2])
mat4 projection_matrix; // 16 | 80 | 80 (column 0)
// 96 (column 1)
// 112 (column 2)
// 128 (column 3)
} transform;
I know that vec3's alignment = vec4's alignment = 32 bits.
Scale is the first component so offset is 0, also it is 4 bits, so it is clear to me that translation needs to be at - let's call it currentPosition - currentPosition + 4.
I do not understand why translation's offset's alignment is 16, though.
Also, it is unclear to me why rotation's offset is 28.
Translation is vec3, it means that there are 3 floats, so 3 * 4 = 12.
My first thought was that we maybe want to round it to a, I do not know how it is called, bit value, but 28 is not a value of that kind.
Same with projection_matrix's offset.
Could someone explain it to me as if I were an idiot, please?
OpenGL does not define a concept called "offset's alignment". Maybe your book is talking about some derived quantity, but since you did not name the book nor did you quote anything more than this example, I cannot say.
The quantities of worth when it comes to std140 layout are the size (how much space it takes up), base alignment, offset, and array stride (obviously only meaningful for arrays). The base alignment imposes a restriction on the offset; the offset must be divisible by the base alignment.
vec3 has a size of 12, since it contains 3 4-byte values. It has a base alignment of 16, because that's what the standard says it has:
If the member is a three-component vector with components consuming N
basic machine units, the base alignment is 4N.
The offset of a member is computed by computing the offset of the previous member, adding the previous member's size, and then applying the base alignment to that value.
So, given that scale has an offset of 0, size of 4, and a base alignment of 4, the offset of translation is 16 (4, rounded up to the nearest base alignment).
The base alignment and array stride of rotation is 16, because that's what the standard says:
If the member is an array of scalars or vectors, the base alignment and array
stride are set to match the base alignment of a single array element, according
to rules(1), (2), and (3), and rounded up to the base alignment of a vec4.
Emphasis added.
So, the offset of translation is 16, and its size is 12. Add them together, and you get 28. To get the offset of rotation, you apply rotation's base alignment, giving you 32.
Also, stop using vec3s.
Related
Summary: How does the compiler statically determine the size of a C++ class during compilation?
Details:
I'm trying to understand what the rules are for determining how much memory a class will use, and also how the memory will be aligned.
For example the following code declares 4 classes. The first 2 are each 16 bytes. But the 3 is 48 bytes, even though it contains the same data members as the first 2. While the fourth class has the same data members as the third, just in a different order, but it is 32 bytes.
#include <xmmintrin.h>
#include <stdio.h>
class TestClass1 {
__m128i vect;
};
class TestClass2 {
char buf[8];
char buf2[8];
};
class TestClass3 {
char buf[8];
__m128i vect;
char buf2[8];
};
class TestClass4 {
char buf[8];
char buf2[8];
__m128i vect;
};
TestClass1 *ptr1;
TestClass2 *ptr2;
TestClass3 *ptr3;
TestClass4 *ptr4;
int main() {
ptr1 = new TestClass1();
ptr2 = new TestClass2();
ptr3 = new TestClass3();
ptr4 = new TestClass4();
printf("sizeof TestClass1 is: %lu\t TestClass2 is: %lu\t TestClass3 is: %lu\t TestClass4 is: %lu\n", sizeof(*ptr1), sizeof(*ptr2), sizeof(*ptr3), sizeof(*ptr4));
return 0;
}
I know that the answer has something to do with alignment of the data members of the class. But I am trying to understand exactly what these rules are and how they get applied during the compilation steps because I have a class that has a __m128i data member, but the data member is not 16-byte aligned and this results in a segfault when the compiler generates code using movaps to access the data.
For POD (plain old data), the rules are typically:
Each member in the structure has some size s and some alignment requirement a.
The compiler starts with a size S set to zero and an alignment requirement A set to one (byte).
The compiler processes each member in the structure in order:
Consider the member’s alignment requirement a. If S is not currently a multiple of a, then add just enough bytes to S so that it is a multiple of a. This determines where the member will go; it will go at offset S from the beginning of the structure (for the current value of S).
Set A to the least common multiple1 of A and a.
Add s to S, to set aside space for the member.
When the above process is done for each member, consider the structure’s alignment requirement A. If S is not currently a multiple of A, then add just enough to S so that it is a multiple of A.
The size of the structure is the value of S when the above is done.
Additionally:
If any member is an array, its size is the number of elements multiplied by the size of each element, and its alignment requirement is the alignment requirement of an element.
If any member is a structure, its size and alignment requirement are calculated as above.
If any member is a union, its size is the size of its largest member plus just enough to make it a multiple of the least common multiple1 of the alignments of all the members.
Consider your TestClass3:
S starts at 0 and A starts at 1.
char buf[8] requires 8 bytes and alignment 1, so S is increased by 8 to 8, and A remains 1.
__m128i vect requires 16 bytes and alignment 16. First, S must be increased to 16 to give the correct alignment. Then A must be increased to 16. Then S must be increased by 16 to make space for vect, so S is now 32.
char buf2[8] requires 8 bytes and alignment 1, so S is increased by 8 to 24, and A remains 16.
At the end, S is 24, which is not a multiple of A (16), so S must be increased by 8 to 32.
So the size of TestClass3 is 32 bytes.
For elementary types (int, double, et cetera), the alignment requirements are implementation-defined and are usually largely determined by the hardware. On many processors, it is faster to load and store data when it has a certain alignment (usually when its address in memory is a multiple of its size). Beyond this, the rules above follow largely from logic; they put each member where it must be to satisfy alignment requirements without using more space than necessary.
Footnote
1 I have worded this for a general case as using the least common multiple of alignment requirements. However, since alignment requirements are always powers of two, the least common multiple of any set of alignment requirements is the largest of them.
It is entirely up to the compiler how the size of a class is determined. A compiler will usually compile to match a certain application binary interface, which is platform dependent.
The behaviour you've observed, however, is pretty typical. The compiler is trying to align the members so that they each begin at a multiple of their size. In the case of TestClass3, the one of the members is of type __m128i and sizeof(__m128i) == 16. So it will try to align that member to begin at a byte that is a multiple of 16. The first member is of type char[8] so takes up 8 bytes. If the compiler were to place the _m128i object directly after this first member, it would start at position 8, which is not a multiple of 16:
0 8 16 24 32 48
┌───────────────┬───────────────────────────────┬───────────────┬┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
│ char[8] │ __m128i │ char[8] │
└───────────────┴───────────────────────────────┴───────────────┴┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
So instead it prefers to do this:
0 8 16 24 32 48
┌───────────────┬┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┬───────────────────────────────┬───────────────┐┄┄┄
│ char[8] │ │ __m128i │ char[8] │
└───────────────┴┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┴───────────────────────────────┴───────────────┘┄┄┄
This gives it a size of 48 bytes.
When you reorder the members to get TestClass4 the layout becomes:
0 8 16 24 32 48
┌───────────────┬───────────────┬───────────────────────────────┬┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
│ char[8] │ char[8] │ __m128i │
└───────────────┴───────────────┴───────────────────────────────┴┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Now everything is correctly aligned - the arrays are at offsets that are multiple of 1 (the size of their elements) and the __m128i object is at an offset that is a multiple of 16 - and the total size is 32 bytes.
The reason the compiler doesn't just do this rearrangement itself is because the standard specifies that later members of a class should have higher addresses:
Nonstatic data members of a (non-union) class with the same access control (Clause 11) are allocated so that later members have higher addresses within a class object.
The rules are set in stone by the Application Binary Interface specification in use, which ensures compatibility between different systems for programs sharing this interface.
For GCC, this is the Itanium ABI.
(Unfortunately it is no longer publicly available, though I did find a mirror.)
if you want to ensure the allignment you should use the "pragma pack(1)" in your h file
look at this post:
http://tedlogan.com/techblog2.html
Playing around with bindless rendering, I have one big static SSBO that holds my vertex data. The vertices are packed in memory as a contiguous array where each vertex has the following layout:
| Position (floats) | Normal (snorm shorts) | Pad |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| P.x | P.y | P.z | N.x | N.y | N.z | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| float | float | float | uint | uint |
Note how each vertex is 20 bytes / 5 "words" / 1.25 vec4s. Not exactly a round number for a GPU. So instead of doing a bunch of padding and using uneccessary memory, I have opted to unpack the data "manually".
Vertex shader:
...
layout(std430, set = 0, binding = 1)
readonly buffer FloatStaticBuffer
{
float staticBufferFloats[];
};
layout(std430, set = 0, binding = 1) // Using the same binding?!
readonly buffer UintStaticBuffer
{
uint staticBufferUInts[];
};
...
void main()
{
const uint vertexBaseDataI = gl_VertexIndex * 5u;
// Unpack position
const vec3 position = vec3(
staticBufferFloats[vertexBaseDataI + 0u],
staticBufferFloats[vertexBaseDataI + 1u],
staticBufferFloats[vertexBaseDataI + 2u]);
// Unpack normal
const vec3 normal = vec3(
unpackSnorm2x16(staticBufferUInts[vertexBaseDataI + 3u]),
unpackSnorm2x16(staticBufferUInts[vertexBaseDataI + 4u]).x);
...
}
It is awfully convenient to be able to "alias" the buffer as both float and uint data.
The question: is "aliasing" a SSBO this way a terrible idea, and I'm just getting lucky, or is this actually a valid option that would work across platforms?
Alternatives:
Use just one buffer, say staticBufferUInts, and then use uintBitsToFloat to extract the positions. Not a big deal, but might have a small performance cost?
Bind the same buffer twice on the CPU to two different bindings. Again, not a big deal, just slightly annoying.
Vulkan allows incompatible resources to alias in memory as long as no malformed values are read from it. (Actually, I think it's allowed even when you read from the invalid sections - you should just get garbage. But I can't find the section of the standard right now that spells this out. The Vulkan standard is way too complicated.)
From the standard, section "Memory Aliasing":
Otherwise, the aliases interpret the contents of the memory
differently, and writes via one alias make the contents of memory
partially or completely undefined to the other alias. If the first alias is a host-accessible subresource, then the bytes affected are those written by the memory operations according to its addressing scheme. If the first alias is not host-accessible, then the bytes
affected are those overlapped by the image subresources that were
written. If the second alias is a host-accessible subresource, the
affected bytes become undefined. If the second alias is not
host-accessible, all sparse image blocks (for sparse
partially-resident images) or all image subresources (for non-sparse
image and fully resident sparse images) that overlap the affected
bytes become undefined.
Note that the standard talks about bytes being written and becoming undefined in aliasing resources. It's not the entire resource that becomes invalid.
Let's see it this way: You have two aliasing SSBOs (in reality just one that's bound twice) with different types (float, short int). Any bytes that you wrote floats into became valid in the "float view" and invalid in the "int view" the moment you wrote into the buffer. The same goes for the ints: The bytes occupied by them have become valid in the int view but invalid in the float view. According to the standard, this means that both views have invalid sections in them; however, neither of them is fully invalid. In particular, the sections you care about are still valid and may be read from.
In short: It's allowed.
I'm flattening out an octree and sending it to my fragment shader using an SSBO, and I believe I am running into some memory alignment issues. I'm using std430 for the layout and binding a vector of voxels to this SSBO this is the structure in my shader. I'm using GLSL 4.3 FYI
struct Voxel
{
bool data; // 4
vec4 pos; // 16
vec4 col; // 16
float size; // 4
int index; // 4
int pIndex; // 4
int cIdx[8]; // 4, 16 or 32 bytes?
};
layout (std430, binding=2) buffer octreeData
{
Voxel voxels[];
};
I'm not 100% sure but I think I'm running into an issue using the int cIdx[8] array inside of the struct, looking at the spec (page 124, section 7.6)
If the member is an array of scalars or vectors, the base alignment and array
stride are set to match the base alignment of a single array element, according
to rules (1), (2), and (3), and rounded up to the base alignment of a vec4. The
array may have padding at the end; the base offset of the member following
the array is rounded up to the next multiple of the base alignment.
I'm not entirely sure what the alignment is, I know the vec4's take up 16 bytes of memory, but how much does my array? If it was just sizeof(int)*8 that would be 32, but it says that it's set to the size of a single array element and then rounded up to a vec4 right? So does that mean my cIdx array has a base alignment of 16 bytes? There's no follow up members so is there padding getting added to my struct?
So total structure memory = 52 bytes (if we only allocate 4 bytes for cIdx), would that mean there is 12 bytes of padding being added on that I need to account for that may be causing me issues? If it was allocating 16 bytes would that be 64 bytes total for the structure and no memory alignment issues?
My corresponding c++ structure
struct Voxel
{
bool data;
glm::vec4 pos;
glm::vec4 col;
float size;
int index;
int pIndex;
int cIdx[8];
};
I'm then filling in my std::vector<Voxel> and passing it to my shader like so
glGenBuffers(1, &octreeSSBO);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, octreeSSBO);
glBufferData(GL_SHADER_STORAGE_BUFFER, voxelData.size()*sizeof(Voxel), voxelData.data(), GL_DYNAMIC_DRAW);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, octreeSSBO);
reading directly from the voxelData vector, I can confirm that the data is getting filled in correctly, and I can even occasionally see that the data is getting passed to the shader but behaving incorrectly compared to what I would expect to see based on the values I'm looking at.
Does it look like there are memory alignment issues here?
I'm not entirely sure what the alignment is
The specification is very clear as to what the base alignment of things are. Your problem is not in item #4 (std430 doesn't do the rounding specified in #4 anyway).
Your problem is in #2:
If the member is a two- or four-component vector with components consuming N basic machine units, the base alignment is 2N or 4N, respectively.
In GLSL, vec4 has a base alignment of 16. That means that any vec4 must be allocated on a 16-byte boundary.
pos must be on a 16-byte boundary. However, data is only 4 bytes. Therefore, 12 bytes of padding must be inserted between data and pos to satisfy std430's alignment requirements.
However, glm::vec4 has a C++ alignment of 4. So the C++ compiler does not insert a bunch of padding between data and pos. Thus, the types in the two languages do not agree.
You should explicitly align all GLM vectors in C++ structs that you want to match GLSL, using C++11's alignas keyword:
struct Voxel
{
bool data;
alignas(16) glm::vec4 pos;
alignas(16) glm::vec4 col;
float size;
int index;
int pIndex;
int cIdx[8];
};
Also, I would not assume that the C++ type bool and the GLSL type bool have the same size.
My GLSL fragment shader skips the "if" statement. The shader itself is very short.
I send some data via a uniform buffer object and use it further in the shader. However, the thing skips the assignment attached to the "if" statement for whatever reason.
I checked the values of the buffer object using glGetBufferSubData (tested with specific non zero values). Everything is where it needs to be. So I'm really kinda lost here. Must be some GLSL weirdness I'm not aware of.
Currently the
#version 420
layout(std140, binding = 2) uniform textureVarBuffer
{
vec3 colorArray; // 16 bytes
int textureEnable; // 20 bytes
int normalMapEnable; // 24 bytes
int reflectionMapEnable; // 28 bytes
};
out vec4 frag_colour;
void main() {
frag_colour = vec4(1.0, 1.0, 1.0, 0.5);
if (textureEnable == 0) {
frag_colour = vec4(colorArray, 0.5);
}
}
You are confusing the base alignment rules with the offsets. The spec states:
The base offset of the first
member of a structure is taken from the aligned offset of the structure itself. The
base offset of all other structure members is derived by taking the offset of the
last basic machine unit consumed by the previous member and adding one. Each
structure member is stored in memory at its aligned offset. The members of a toplevel
uniform block are laid out in buffer storage by treating the uniform block as
a structure with a base offset of zero.
It is true that a vec3 requires a base alignment of 16 bytes, but it only consumes 12 bytes. As a result, the next element after the vec3 will begin 12 bytes after the aligned offset of the vec3 itself. Since the alignment rules for int are just 4 bytes, there will be no padding at all.
Summary: How does the compiler statically determine the size of a C++ class during compilation?
Details:
I'm trying to understand what the rules are for determining how much memory a class will use, and also how the memory will be aligned.
For example the following code declares 4 classes. The first 2 are each 16 bytes. But the 3 is 48 bytes, even though it contains the same data members as the first 2. While the fourth class has the same data members as the third, just in a different order, but it is 32 bytes.
#include <xmmintrin.h>
#include <stdio.h>
class TestClass1 {
__m128i vect;
};
class TestClass2 {
char buf[8];
char buf2[8];
};
class TestClass3 {
char buf[8];
__m128i vect;
char buf2[8];
};
class TestClass4 {
char buf[8];
char buf2[8];
__m128i vect;
};
TestClass1 *ptr1;
TestClass2 *ptr2;
TestClass3 *ptr3;
TestClass4 *ptr4;
int main() {
ptr1 = new TestClass1();
ptr2 = new TestClass2();
ptr3 = new TestClass3();
ptr4 = new TestClass4();
printf("sizeof TestClass1 is: %lu\t TestClass2 is: %lu\t TestClass3 is: %lu\t TestClass4 is: %lu\n", sizeof(*ptr1), sizeof(*ptr2), sizeof(*ptr3), sizeof(*ptr4));
return 0;
}
I know that the answer has something to do with alignment of the data members of the class. But I am trying to understand exactly what these rules are and how they get applied during the compilation steps because I have a class that has a __m128i data member, but the data member is not 16-byte aligned and this results in a segfault when the compiler generates code using movaps to access the data.
For POD (plain old data), the rules are typically:
Each member in the structure has some size s and some alignment requirement a.
The compiler starts with a size S set to zero and an alignment requirement A set to one (byte).
The compiler processes each member in the structure in order:
Consider the member’s alignment requirement a. If S is not currently a multiple of a, then add just enough bytes to S so that it is a multiple of a. This determines where the member will go; it will go at offset S from the beginning of the structure (for the current value of S).
Set A to the least common multiple1 of A and a.
Add s to S, to set aside space for the member.
When the above process is done for each member, consider the structure’s alignment requirement A. If S is not currently a multiple of A, then add just enough to S so that it is a multiple of A.
The size of the structure is the value of S when the above is done.
Additionally:
If any member is an array, its size is the number of elements multiplied by the size of each element, and its alignment requirement is the alignment requirement of an element.
If any member is a structure, its size and alignment requirement are calculated as above.
If any member is a union, its size is the size of its largest member plus just enough to make it a multiple of the least common multiple1 of the alignments of all the members.
Consider your TestClass3:
S starts at 0 and A starts at 1.
char buf[8] requires 8 bytes and alignment 1, so S is increased by 8 to 8, and A remains 1.
__m128i vect requires 16 bytes and alignment 16. First, S must be increased to 16 to give the correct alignment. Then A must be increased to 16. Then S must be increased by 16 to make space for vect, so S is now 32.
char buf2[8] requires 8 bytes and alignment 1, so S is increased by 8 to 24, and A remains 16.
At the end, S is 24, which is not a multiple of A (16), so S must be increased by 8 to 32.
So the size of TestClass3 is 32 bytes.
For elementary types (int, double, et cetera), the alignment requirements are implementation-defined and are usually largely determined by the hardware. On many processors, it is faster to load and store data when it has a certain alignment (usually when its address in memory is a multiple of its size). Beyond this, the rules above follow largely from logic; they put each member where it must be to satisfy alignment requirements without using more space than necessary.
Footnote
1 I have worded this for a general case as using the least common multiple of alignment requirements. However, since alignment requirements are always powers of two, the least common multiple of any set of alignment requirements is the largest of them.
It is entirely up to the compiler how the size of a class is determined. A compiler will usually compile to match a certain application binary interface, which is platform dependent.
The behaviour you've observed, however, is pretty typical. The compiler is trying to align the members so that they each begin at a multiple of their size. In the case of TestClass3, the one of the members is of type __m128i and sizeof(__m128i) == 16. So it will try to align that member to begin at a byte that is a multiple of 16. The first member is of type char[8] so takes up 8 bytes. If the compiler were to place the _m128i object directly after this first member, it would start at position 8, which is not a multiple of 16:
0 8 16 24 32 48
┌───────────────┬───────────────────────────────┬───────────────┬┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
│ char[8] │ __m128i │ char[8] │
└───────────────┴───────────────────────────────┴───────────────┴┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
So instead it prefers to do this:
0 8 16 24 32 48
┌───────────────┬┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┬───────────────────────────────┬───────────────┐┄┄┄
│ char[8] │ │ __m128i │ char[8] │
└───────────────┴┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┴───────────────────────────────┴───────────────┘┄┄┄
This gives it a size of 48 bytes.
When you reorder the members to get TestClass4 the layout becomes:
0 8 16 24 32 48
┌───────────────┬───────────────┬───────────────────────────────┬┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
│ char[8] │ char[8] │ __m128i │
└───────────────┴───────────────┴───────────────────────────────┴┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
Now everything is correctly aligned - the arrays are at offsets that are multiple of 1 (the size of their elements) and the __m128i object is at an offset that is a multiple of 16 - and the total size is 32 bytes.
The reason the compiler doesn't just do this rearrangement itself is because the standard specifies that later members of a class should have higher addresses:
Nonstatic data members of a (non-union) class with the same access control (Clause 11) are allocated so that later members have higher addresses within a class object.
The rules are set in stone by the Application Binary Interface specification in use, which ensures compatibility between different systems for programs sharing this interface.
For GCC, this is the Itanium ABI.
(Unfortunately it is no longer publicly available, though I did find a mirror.)
if you want to ensure the allignment you should use the "pragma pack(1)" in your h file
look at this post:
http://tedlogan.com/techblog2.html