OpenGL Shader Storage Buffer / memoryBarrierBuffer - c++

I currently created two SSBO's to handle some lights because the VS-FS in out interface can't handle a lot of lights (Im using forward shading).
For the first one I only pass the values to the shader (basically a read only one) [cpp]:
struct GLightProperties
{
unsigned int numLights;
LightProperties properties[];
};
...
glp = (GLightProperties*)malloc(sizeof(GLightProperties) + sizeof(LightProperties) * lastSize);
...
glGenBuffers(1, &ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);
glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(GLightProperties) + sizeof(LightProperties) * lastSize, glp, GL_DYNAMIC_COPY);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
Shader file [GLSL]:
layout(std430, binding = 1) buffer Lights
{
uint numLights;
LightProperties properties[];
}lights;
So this first SSBO turns out to work fine. However, in the other one, which purpose is VS-FS interface, has some issues:
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo2);
glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(float) * 4 * 3 * lastSize, nullptr, GL_DYNAMIC_COPY);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
GLSL:
struct TangentProperties
{
vec4 TangentLightPos;
vec4 TangentViewPos;
vec4 TangentFragPos;
};
layout(std430, binding = 0) buffer TangentSpace
{
TangentProperties tangentProperties[];
}tspace;
So here you notice I pass nullptr to the glBufferData because the vs will write in the buffer and the fs will read its contents.
Like so in the VS Stage:
for(int i = 0; i < lights.numLights; i++)
{
tspace.tangentProperties[index].TangentLightPos.xyz = TBN * lights.properties[index].lightPosition.xyz;
tspace.tangentProperties[index].TangentViewPos.xyz = TBN * camPos;
tspace.tangentProperties[index].TangentFragPos.xyz = TBN * vec3(worldPosition);
memoryBarrierBuffer();
}
After this the FS reads the values, which turn out to be just garbage. Am I doing something wrong with memory barriers?
The output turns out this way:

OK, let's get the obvious bug out of the way:
for(int i = 0; i < lights.numLights; i++)
{
tspace.tangentProperties[index].TangentLightPos.xyz = TBN * lights.properties[index].lightPosition.xyz;
tspace.tangentProperties[index].TangentViewPos.xyz = TBN * camPos;
tspace.tangentProperties[index].TangentFragPos.xyz = TBN * vec3(worldPosition);
memoryBarrierBuffer();
}
index never changes in this loop, so you're only writing a single light, and you're only writing the last lights' values. All other lights will have garbage/undefined values.
So you probably meant i rather than index.
But that's only the beginning of the problem. See, if you make that change, you get this:
for(int i = 0; i < lights.numLights; i++)
{
tspace.tangentProperties[i].TangentLightPos.xyz = TBN * lights.properties[i].lightPosition.xyz;
tspace.tangentProperties[i].TangentViewPos.xyz = TBN * camPos;
tspace.tangentProperties[i].TangentFragPos.xyz = TBN * vec3(worldPosition);
}
memoryBarrierBuffer();
Note that the barrier is outside the loop.
That creates a new problem. This code will have every vertex shader invocation writing to the same memory buffer. SSBOs, after all, are not VS output variables. Output variables are stored as part of a vertex. The rasterizer then interpolates this vertex data across the primitive as it rasterizes it, which provides the input values for the FS. So one VS cannot stomp on the output variables of another VS.
That doesn't happen with SSBOs. Every VS is acting on the same SSBO memory. So if they write to the same indices of the same array, they're writing to the same memory address. Which is a race condition (since there can be no synchronization between sibling invocations) and therefore undefined behavior.
So, the only way what you're trying to do could possibly work is if your buffer has numLights entries for each vertex in the entire scene.
This is a fundamentally unreasonable amount of space. Even if you could get it down to just the number of vertices in a particular draw call (which is doable, but I'm not going to say how), you would still be behind in performance. Every FS invocation will have to perform reads of 144 bytes of data for each light (3 table entries, one for each vertex of the triangle), linearly interpolate those values, and then do lighting computations.
It would be faster for you to just pass the TBN matrix as a VS output and do the matrix multiplies in the FS. Yes, that's a lot of matrix multiplies, but GPUs are really fast at matrix multiplies, and are really slow at reading memory.
Also, reconsider whether you need the tangent-space fragment position. Generally speaking, you never do.

Related

Unexpeced value upon accessing an SSBO float

I am trying to calculate a morph offset for a gpu driven animation.
To that effect I have the following function (and SSBOS):
layout(std140, binding = 7) buffer morph_buffer
{
vec4 morph_targets[];
};
layout(std140, binding = 8) buffer morph_weight_buffer
{
float morph_weights[];
};
vec3 GetMorphOffset()
{
vec3 offset = vec3(0);
for(int target_index=0; target_index < target_count; target_index++)
{
float w1 = morph_weights[1];
offset += w1 * morph_targets[target_index * vertex_count + gl_VertexIndex].xyz;
}
return offset;
}
I am seeing strange behaviour so I opened renderdoc to trace the state:
As you can see, index 1 of the morph_weights SSBO is 0. However if I step over in the built in debugger for renderdoc I obtain:
Or in short, the variable I get back is 1, not 0.
So I did a little experiment and changed one of the values and now the SSBO looks like this:
And now I get this:
So my SSBO of type float is being treated like an ssbo of vec4's it seems. I am aware of alignment issues with vec3's, but IIRC floats are fair game. What is happenning?
Upon doing a little bit of asking around.
The issue is the SSBO is marked as std140, the correct std for a float array is std430.
For the vulkan GLSL dialect, an alternative is to use the scalar qualifier.

What is the relationship with index in glBindBufferBase() and stream number in geometry shader in OpenGL?

I want to transform data from geometry shaders to feedback buffer,so I set the names of variables like this:
char *Varying[] = {"lColor","lPos","gl_NextBuffer","rColor","rPos"};
then bind two buffers vbos[2] to transformFeedback object Tfb,vbos[0] to bind point 0, and vbos[1] to bind point 2:
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, vbos[0]);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 2, vbos[1]);
and set stream number like this:
layout(stream=0) out vec4 lColor;
layout(stream=0) out vec4 lPos;
layout(stream = 2) out rVert
{
vec4 rColor;
vec4 rPos;
};
So I thought lColor and lPos are transformed to vbos[0],rColor and rPos are transformed to vbos[1],but nothing came out,unless I change vbos[1] to bind point 1,and stream number of rvert to 1.
I think there is three variables about indexes for binding geometry shader output and feedback buffers,first is index in glBindBufferBase(target,index,id),second is sequences of names in glTransformFeedbackVaryings(program,count,names,buffermode),third is stream number of variables in geometry shader,what is the relationship of these three parameters?
And I found that no matter what parameters I set the glDrawTransformFeedbackStream ,picture didn't change,Why?
The geometry shader layout qualifier stream has no direct relationship to the TF buffer binding index assignment. There is one caveat here: variables from two different streams can't be assigned to the same buffer binding index. But aside from that, the two have no relationship.
In geometry shader transform feedback, streams are written independently of each other. When you output a vertex/primitive, you say which stream the GS invocation is writing. Only the output variables assigned to that stream are written, and the system keeps track of how much stuff has been written to each stream.
But how the output variables map to feedback buffers is entirely separate (aside from the aforementioned caveat).
In your example, glTransformFeedbackVaryings with {"lColor","lPos","gl_NextBuffer","rColor","rPos"} does the following. It starts with buffer index 0. It assigns lColor and lPos to buffer index 0. gl_NextBuffer causes the system to increment the value of the current buffer index. That value is 0, so incrementing it makes it 1. It then assigns rColor and rPos to buffer 1.
That's why your code doesn't work.
You can skip buffer indices in the TF buffer bindings. To do that, you have to use gl_NextBuffer twice, since each use increments the current buffer index.
Or if your GL version is high enough, you can just assign the buffer bindings and offsets directly in your shader:
layout(stream=0, xfb_offset = 0, xfb_buffer = 0) out vec4 lColor;
layout(stream=0, xfb_offset = 16, xfb_buffer = 0) out vec4 lPos;
layout(stream = 2, xfb_offset = 0, xfb_buffer = 2) out rVert
{
vec4 rColor;
vec4 rPos;
};

GLSL Compute Shader Setting buffer with lookup table results in no data written, setting the same buffer with other data works

I am attempting to implement a slightly modified version of this standard marching cubes algorithm in a compute shader.
I have reached the stage at which triTable is used to insert the correct vertex indices into a buffer and have modified the table to be 1 dimensional (const int triTable[4096]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,8,3...})
The following code shows the error that I am experiencing (this does not implement the algorithm however it demonstrates the current issue fully):
layout(binding=1) buffer Grid
{
float GridData[]; //contains 512*512*512 data volume previously generated, unused in this test case
};
uniform uint marchableCount;
uniform uint pointCount;
layout(std430, binding = 4) buffer X {uvec4 marchableList[];}; //format is x,y,z,cubeIndex
layout(std430, binding = 5) buffer v {vec4 vertices[];};
layout(std430,binding = 6) buffer n {vec4 normals[];};
layout(binding = 7) uniform atomic_uint triCount;
void main()
{
uvec3 gid = marchableList[gl_GlobalInvocationID.x].xyz; //xyz of grid cell
int E = int(edgeTable[marchableList[gl_GlobalInvocationID.x].w]);
if (E != 0)
{
uint cubeIndex = marchableList[gl_GlobalInvocationID.x].w;
uint index = atomicCounterIncrement(triCount);
int tCount = 0;//unused in this test, used for iteration in actual algorithm
int tGet = tCount+16*int(cubeIndex); //correction from converting 2d array to 1d array
vertices[index] = vec4(tGet);
}
}
This code produces expected values: the vertices buffer is filled with data and the atomic counter increments
changing this line:
vertices[index] = vec4(tGet);
to
vertices[index] = vec4(triTable[tGet]);
or
vertices[index] = vec4(triTable[tGet]+1);
(demonstrating that triTable is not coincidentally returning zeros)
results in what appears to be a complete failure of the shader: the buffer is filled with zeros and the atomic counter does not increment. No error messages are output when the shader is compiled. tGet is less than 4096.
The following test cases also produce the correct output:
vertices[index] = vec4(triTable[3]); //-1
vertices[index] = vec4(triTable[4095]); //also -1
showing that triTable is in fact implemented correctly
What causes the shader to have issues in these very specific cases?
I'm more surprised that const int triTable[4096] = {...}; compiles at all. That array, if it is actually needed, is 16KB in size. That's a lot for a shader, even if the array lives in shared memory.
What is most likely happening is that, whenever the compiler detects usage of this array that it can't optimize it out to a simple value (triTable[3] will always be 1, so the compiler doesn't need to store the whole table), the compilation either fails or results in a non-functional shader.
It would be best to make this table a uniform buffer. An SSBO might work too, but some hardware implements uniform blocks through specialized memory rather than with a global memory fetch.

OpenGL 4.5 - Shader storage buffer objects layout

I'm trying my hand at shader storage buffer objects (aka Buffer Blocks) and there are a couple of things I don't fully grasp. What I'm trying to do is to store the (simplified) data of an indeterminate number of lights n in them, so my shader can iterate through them and perform calculations.
Let me start by saying that I get the correct results, and no errors from OpenGL. However, it bothers me not to know why it is working.
So, in my shader, I got the following:
struct PointLight {
vec3 pos;
float intensity;
};
layout (std430, binding = 0) buffer PointLights {
PointLight pointLights[];
};
void main() {
PointLight light;
for (int i = 0; i < pointLights.length(); i++) {
light = pointLights[i];
// etc
}
}
and in my application:
struct PointLightData {
glm::vec3 pos;
float intensity;
};
class PointLight {
// ...
PointLightData data;
// ...
};
std::vector<PointLight*> pointLights;
glGenBuffers(1, &BBO);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, BBO);
glNamedBufferStorage(BBO, n * sizeof(PointLightData), NULL, GL_DYNAMIC_STORAGE_BIT);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, BBO);
...
for (unsigned int i = 0; i < pointLights.size(); i++) {
glNamedBufferSubData(BBO, i * sizeof(PointLightData), sizeof(PointLightData), &(pointLights[i]->data));
}
In this last loop I'm storing a PointLightData struct with an offset equal to its size times the number of them I've already stored (so offset 0 for the first one).
So, as I said, everything seems correct. Binding points are correctly set to the zeroeth, I have enough memory allocated for my objects, etc. The graphical results are OK.
Now to my questions. I am using std430 as the layout - in fact, if I change it to std140 as I originally did it breaks. Why is that? My hypothesis is that the layout generated by std430 for the shader's PointLights buffer block happily matches that generated by the compiler for my application's PointLightData struct (as you can see in that loop I'm blindingly storing one after the other). Do you think that's the case?
Now, assuming I'm correct in that assumption, the obvious solution would be to do the mapping for the sizes and offsets myself, querying opengl with glGetUniformIndices and glGetActiveUniformsiv (the latter called with GL_UNIFORM_SIZE and GL_UNIFORM_OFFSET), but I got the sneaking suspicion that these two guys only work with Uniform Blocks and not Buffer Blocks like I'm trying to do. At least, when I do the following OpenGL throws a tantrum, gives me back a 1281 error and returns a very weird number as the indices (something like 3432898282 or whatever):
const char * names[2] = {
"pos", "intensity"
};
GLuint indices[2];
GLint size[2];
GLint offset[2];
glGetUniformIndices(shaderProgram->id, 2, names, indices);
glGetActiveUniformsiv(shaderProgram->id, 2, indices, GL_UNIFORM_SIZE, size);
glGetActiveUniformsiv(shaderProgram->id, 2, indices, GL_UNIFORM_OFFSET, offset);
Am I correct in saying that glGetUniformIndices and glGetActiveUniformsiv do not apply to buffer blocks?
If they do not, or the fact that it's working is like I imagine just a coincidence, how could I do the mapping manually? I checked appendix H of the programming guide and the wording for array of structures is somewhat confusing. If I can't query OpenGL for sizes/offsets for what I'm tryind to do, I guess I could compute them manually (cumbersome as it is) but I'd appreciate some help in there, too.

Can you modify a uniform from within the shader? If so. how?

So I wanted to store all my meshes in one large VBO. The problem is, how do you do have just one draw call, but let every mesh have its own model to world matrix?
My idea was to submit an array of matrices to a uniform before drawing. In the VBO I would make the color of every first vertex of a mesh negative (So I'd be using the signing bit to check whether a vertex was the first of a mesh).
Okay, so I can detect when a new mesh has started and I have an array of matrices ready and probably a uniform called 'index'. But how do I increase this index by one every time I encounter a new mesh?
Can you modify a uniform from within the shader? If so, how?
Can you modify a uniform from within the shader?
If you could, it wouldn't be uniform anymore, would it?
Furthermore, what you're wanting to do cannot be done even with Image Load/Store or SSBOs, both of which allow shaders to write data. It won't work because vertex shader invocations are not required to be executed sequentially. Many happen at the same time, and there's no way for any shader invocation to know that it will happen "after" the "first vertex" in a mesh.
The simplest way to deal with this is the obvious solution. Render each mesh individually, but set the uniforms for each mesh before each draw call. Without changing buffers between draws, of course. Uniform changes, while not exactly cheap, aren't the most expensive state changes that exist.
There are more complicated drawing methods that could allow you more performance. But that form is adequate for most needs. You've already done the hard part: you removed the need for any state change (textures, buffers, vertex formats, etc) except uniform state.
There are two approaches to minimize draw calls - instancing and batching. The first (instancing) allows you to draw multiple copies of same meshes in one draw call, but it depends on the API (is available from OpenGL 3.1). Batching is similar to instancing but allows you to draw different meshes. Both of these approaches have restrictions - meshes should be with the same materials and shaders.
If you would to draw different meshes in one VBO then instancing is not an option. So, batching requires keeping all meshes in 'big' VBO with applied world transform. It not a problem with static meshes, but have some discomfort with animated. I give you some pseudocode with batching implementation
struct SGeometry
{
uint64_t offsetVB;
uint64_t offsetIB;
uint64_t sizeVB;
uint64_t sizeIB;
glm::mat4 oldTransform;
glm::mat4 transform;
}
std::vector<SGeometry> cachedGeometries;
...
void CommitInstances()
{
uint64_t vertexOffset = 0;
uint64_t indexOffset = 0;
for (auto instance in allInstances)
{
Copy(instance->Vertexes(), VBO);
for (uint64_t i = 0; i < instances->Indices().size(); ++i)
{
auto index = instances->Indices()[i];
index += indexOffset;
IBO[i] = index;
}
cachedGeometries.push_back({vertexOffset, indexOffset});
vertexOffset += instance->Vertexes().size();
indexOffset += instance->Indices().size();
}
Commit(VBO);
Commit(IBO);
}
void ApplyTransform(glm::mat4 modelMatrix, uint64_t instanceId)
{
const SGeometry& geom = cachedGeometries[i];
glm::mat4 inverseOldTransform = glm::inverse(geom.oldTransform);
VertexStream& stream = VBO->GetStream(Position, geom.offsetVB);
for (uint64_t i = 0; i < geom.sizeVB; ++i)
{
glm::vec3 pos = stream->Get(i);
// We need to revert absolute transformation before applying new
pos = glm::vec3(inverseOldNormalTransform * glm::vec4(pos, 1.0f));
pos = glm::vec3(normalTransform * glm::vec4(pos, 1.0f));
stream->Set(i);
}
// .. Apply normal transformation
}
GPU Gems 2 has a good article about geometry instancing http://www.amazon.com/GPU-Gems-Programming-High-Performance-General-Purpose/dp/0321335597