Ring buffered SSBO with compute shader - opengl

I am performing view frustum culling and generating draw commands on the GPU in a compute shader and I want to pass the bounding volumes in a SSBO. Currently I am using just a large uniform array but I want to go bigger thus the need to move to a SSBO.
The thing I want to accomplish is something a kin to the AZDO approach of using triple buffering in order to avoid sync issues when updating the SSBO by only updating one third of the buffer while guarding the rest with fences.
Is this possible to combine with the compute shader dispatch or should I just create three different SSBOs and then bind each of them accordingly?
The solution as I currently see it would be to somehow tell the following drawcall to only fetch data in the SSBO from a certain offset (0 * buffer_size, 1 * buffer_size, etc). Is this even possible?
Render loop
/* Fence creation omitted for clarity */
// Cycle round updating different parts of the buffer
const uint32_t buffer_idx = (frame % gl_state.bvb_num_partitions);
uint8_t* ptr = (uint8_t*)gl_state.bvbp + buffer_idx * gl_state.bvb_buffer_size;
std::memcpy(ptr, bounding_volumes.data(), gl_state.bvb_buffer_size);
const uint32_t gl_bv_binding_point = 3; // Shader hard coded
const uint32_t offset = buffer_idx * gl_state.bvb_buffer_size;
glBindBufferRange(GL_SHADER_STORAGE_BUFFER, gl_bv_binding_point, gl_state.bvb, offset, gl_state.bvb_buffer_size);
// OLD WAY: glUniform4fv(glGetUniformLocation(gl_state.cull_shader.gl_program, "spheres"), NUM_OBJECTS, &bounding_volumes[0].pos.x);
glUniform4fv(glGetUniformLocation(gl_state.cull_shader.gl_program, "frustum_planes"), 6, glm::value_ptr(frustum[0]));
glDispatchCompute(NUM_OBJECTS, 1, 1);
glMemoryBarrier(GL_COMMAND_BARRIER_BIT | GL_SHADER_STORAGE_BARRIER_BIT); // Buffer objects affected by this bit are derived from the GL_DRAW_INDIRECT_BUFFER binding.
Bounding volume SSBO creation
// Bounding volume buffer
glGenBuffers(1, &gl_state.bvb);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, gl_state.bvb);
gl_state.bvb_buffer_size = NUM_OBJECTS * sizeof(BoundingVolume);
gl_state.bvb_num_partitions = 3; // 1 for application, 1 for OpenGL driver, 1 for GPU
GLbitfield flags = GL_MAP_COHERENT_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT;
glBufferStorage(GL_SHADER_STORAGE_BUFFER, gl_state.bvb_num_partitions * gl_state.bvb_buffer_size, nullptr, flags);
gl_state.bvbp = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, gl_state.bvb_buffer_size * gl_state.bvb_num_partitions, flags);

Related

OpenGL, glMapNamedBuffer takes a long time

I've been writing an openGL program that generates vertices on the GPU using compute shaders, the problem is I need to read back the number of vertices from a buffer written to by one compute shader dispatch on the CPU so that I can allocate a buffer of the right size for the next compute shader dispatch to fill with vertices.
/*
* Stage 1- Populate the 3d texture with voxel values
*/
_EvaluateVoxels.Use();
glActiveTexture(GL_TEXTURE0);
GLPrintErrors("glActiveTexture(GL_TEXTURE0);");
glBindTexture(GL_TEXTURE_3D, _RandomSeedTexture);
glBindImageTexture(2, _VoxelValuesTexture, 0, GL_TRUE, NULL, GL_READ_WRITE, GL_R32F);
_EvaluateVoxels.SetVec3("CellSize", voxelCubeDims);
SetMetaBalls(metaballs);
_EvaluateVoxels.SetVec3("StartPos", chunkPosLL);
glDispatchCompute(voxelDim.x + 1, voxelDim.y + 1, voxelDim.z + 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
/*
* Stage 2 - Calculate the marching cube's case for each cube of 8 voxels,
* listing those that contain polygons and counting the no of vertices that will be produced
*/
_GetNonEmptyVoxels.Use();
_GetNonEmptyVoxels.SetFloat("IsoLevel", isoValue);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, _IntermediateDataSSBO);
glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, _AtomicCountersBuffer);
glDispatchCompute(voxelDim.x, voxelDim.y, voxelDim.z);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT | GL_ATOMIC_COUNTER_BARRIER_BIT);
//printStage2(_IntermediateDataSSBO, true);
_StopWatch.StopTimer("stage2");
_StopWatch.StartTimer("getvertexcounter");
// this line takes a long time
unsigned int* vals = (unsigned int*)glMapNamedBuffer(_AtomicCountersBuffer, GL_READ_WRITE);
unsigned int vertex_counter = vals[1];
unsigned int index_counter = vals[0];
vals[0] = 0;
vals[1] = 0;
glUnmapNamedBuffer(_AtomicCountersBuffer);
The image below shows times in milliseconds that each stage of the code takes to run, "timer Evaluate" refers to the method as a whole, IE the sum total of the previous stages. getvertexcounter refers to only the mapping, reading and unmapping of a buffer containing the number of vertices. Please see code for more detail.
I've found this to be by far the slowest stage in the process, and I gather it has something to do with the asynchronous nature of the communication between openGL and the GPU and the need to synchronise data that was written by the compute shader so it can be read by the CPU. My question is this: Is this delay avoidable? I don't think that the overall approach is flawed because I know that someone else has implemented the algorithm in a similar way, albeit using direct X (I think).
You can find my code at https://github.com/JimMarshall35/Marching-cubes-cpp/tree/main/MarchingCubes , the code in question is in the file ComputeShaderMarcher.cpp and the method unsigned int ComputeShaderMarcher::GenerateMesh(const glm::vec3& chunkPosLL, const glm::vec3& chunkDim, const glm::ivec3& voxelDim, float isoValue, GLuint VBO)
In order to access data from a buffer that you have had OpenGL write some data to, the CPU must halt execution until the GPU has actually written that data. Whatever process you use to access this data (glMapBufferRange, glGetBufferSubData, etc), that process must halt until the GPU has finished generating the data.
So don't try to access GPU-generated data until you're sure the GPU has actually generated it (or you have absolutely nothing better to do on the CPU than wait). Use fence sync objects to test whether the GPU has finished executing past a certain point.

Indices Problem with a Batch Renderer (OpenGL)

I'm trying to implement batch rendering for 3D objects in an engine I'm doing, and I can't manage to get the indices fine.
So in a 3D Renderer class I have a Renderer3DData structure that looks like the next:
static const uint MaxQuads = 20000;
static const uint MaxVertices = MaxQuads * 4;
static const uint MaxIndices = MaxQuads * 6;
uint IndicesDrawCount = 0; // Debug var
std::vector<uint> Indices;
Ref<IndexBuffer> IBuffer = nullptr;
// Other data like a VBuffer, VArray...
So the vector of Indices will store the indices to draw on each batch while the IBuffer is the Index Buffer class which handles all OpenGL operations ("Ref" is a typedef to make a shared pointer).
Then a static Renderer3DData* s_3DData; is initialized in the init function and the index buffer is initialized as follows:
uint* indices = new uint[s_3DData->MaxIndices];
s_3DData->IBuffer = IndexBuffer::Create(indices, s_3DData->MaxIndices);
And then bounded together with the Vertex Array and the Vertex Buffer, the initialization process is properly done since without batching this works.
So on each new batch the VArray gets bound and the Indices vector gets cleared and, on each mesh drawn, it gets modified like this:
uint offset = 0;
std::vector<uint> indices = mesh->m_Indices;
for (uint i = 0; i < indices.size(); i += 6)
{
s_3DData->Indices.push_back(offset + 0 + indices[i]);
s_3DData->Indices.push_back(offset + 1 + indices[i]);
s_3DData->Indices.push_back(offset + 2 + indices[i]);
s_3DData->Indices.push_back(offset + 3 + indices[i]);
s_3DData->Indices.push_back(offset + 4 + indices[i]);
s_3DData->Indices.push_back(offset + 5 + indices[i]);
offset += 4;
s_3DData->IndicesDrawCount += 6;
}
I don't know how I did come up with this way of setting the index buffer, I was testing things to do it, pushing only the indices or the indices + offset doesn't works neither. Finally, on each draw, I do the next:
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, BufferID);
glBufferSubData(GL_ELEMENT_ARRAY_BUFFER, 0, s_3DData->Indices.size(), s_3DData->Indices.data());
// With the vArray bound:
glDrawElements(GL_TRIANGLES, s_3DData->IndicesDrawCount, GL_UNSIGNED_INT, nullptr);
As I mentioned, when I'm not batching, the drawing (which doesn't goes through all this process), works, so the data in the mesh and the vertex/index buffers must be good, what I think it's wrong is the way to set the index buffer since I'm not sure how to even set it up (unlike other rendering stuff).
The result is the next one (should be a solid sphere):
The way that "sphere" is rendered makes me think that the indices are wrong. And the objects in the center are objects drawn without batching for me to know that it's not the initial setup that's wrong. Does anybody sees what I'm doing wrong?
I finally solved it (I'm crying, I've been with this a lot of time).
So there was a couple of problems:
First: The function s_3DData->IBuffer = IndexBuffer::Create(indices, s_3DData->MaxIndices); that I posted was doing the next:
glCreateBuffers(1, &m_BufferID);
glBindBuffer(GL_ARRAY_BUFFER, m_BufferID);
glBufferData(GL_ARRAY_BUFFER, count * sizeof(uint), nullptr, GL_STATIC_DRAW);
So the first problem was that I was creating index buffers with GL_STATIC_DRAW instead of GL_DYNAMIC_DRAW as required to batch since we are dynamically updating the buffer (this was my bad to not to post the function entirely, I was pretty asleep when I posted it, I should have done it).
Second: The function glBufferSubData(GL_ELEMENT_ARRAY_BUFFER, 0, s_3DData->Indices.size(), s_3DData->Indices.data()); was wrong on the size parameter.
OpenGL requires the size of this function to be the total size of the buffer that we want to update, which is not the vector size but the vector size multiplied by sizeof(uint) (in this case, uint because the vector is a uint vector).
Third: And final problem was the loop that modified the indices vector on each mesh draw, it was wrong and thought from the point of view of drawing quads in 2D (as I was previously testing batching in 2D).
The correct loop is the next:
std::vector<uint> indices = mesh->m_Indices;
for (uint i = 0; i < indices.size(); ++i)
{
s_3DData->Indices.push_back(s_3DData->IndicesCurrentOffset + indices[i]);
++s_3DData->IndicesDrawCount;
++s_3DData->RendererStats.IndicesCount; // Debug Purpose
}
s_3DData->IndicesCurrentOffset += mesh->m_MaxIndex;
So now each mesh stores the (max index + 1) that it has (for a quad with indices from 0 to 3, this would be 4).
This way, I can go through all mesh indices while updating the indices that we use to draw and then I can update the current offset value so that we properly store all the indices drawn in order.
Again, I'm not intending this to be fast nor performative, I was just learning how to do this (and I did :) ).
The result:

How to count dead particles in the compute shader?

I am working on particle system. For calculation of each particle position, time alive and so on I use compute shader. I have problem to get count of dead particles back to the cpu, so I can set how many particles renderer should render. To store data of particles i use shader storage buffer. To render particles i use instancing. I tried to use atomic buffer counter, it works fine, but it is slow to copy data from buffer to the cpu. I wonder if there is some other option.
This is important part of compute shader
if (pData.timeAlive >= u_LifeTime)
{
pData.velocity = pData.defaultVelocity;
pData.timeAlive = 0;
pData.isAlive = u_Loop;
atomicCounterIncrement(deadParticles)
pVertex.position.x = pData.defaultPosition.x;
pVertex.position.y = pData.defaultPosition.y;
}
InVertex[id] = pVertex;
InData[id] = pData;
To copy data to the cpu i use following code
uint32_t* OpenGLAtomicCounter::GetCounters()
{
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, m_AC);
glGetBufferSubData(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(uint32_t) * m_NumberOfCounters, m_Counters);
return m_Counters;
}

How to flip data from frame buffer

I am attaching a PBO to the Opengl framebuffer and than use glMapBuffer() to get access to the data.
I am passing the data to a Bluefish card for SDI Output.
The issue is that the resultant output appears inverted.
How can i invert y axis of the data being pointed by PBO pointer.
glReadBuffer(GL_FRONT);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[writeIndex]);
// copy from framebuffer to PBO asynchronously. it will be ready in the NEXT frame
glReadPixels(0, 0, SCR_WIDTH, SCR_HEIGHT, GL_RGB, GL_UNSIGNED_BYTE, nullptr);
// now read other PBO which should be already in CPU memory
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[readIndex]);
// map buffer so we can access it
void* downsampleData = (unsigned char *)glMapBuffer(GL_PIXEL_PACK_BUFFER,GL_READ_ONLY);
This is how i am trying to flip data after advised by Nicol and i get the desired result.
unsigned char OriginalData[width * height * 4];
unsigned char FlippedData[width * height * 4];
memcpy( OriginalData , downsampleData , sizeof( OriginalData) ); // copy data from the pointer.
for( int i = sizeof( OriginalData) - 1; i >= 0 ; i-- )
{
Flippeddata[k] = OriginalData[sizeof( OriginalData) - 1 - 1];
}
You can't. OpenGL always considers the first row to be the bottom row of the image data for any image operation (sending/receiving pixel blocks, fetching texture samples/image data in a shader, etc). So if you want to invert the data you get, you will have to do that manually by copying the data around.

How do I load multiple structs into a single UBO?

I am following the tutorials on: Here.
I have completed up till loading models so my code is similar to that point.
I am now trying to pass another struct to the Uniform Buffer Object, in a similar way to previously shown.
I have created another struct defined outside the application to store the data as follows:
struct Light{
alignas(16) glm::vec3 position;
alignas(16) glm::vec3 colour;
};
After doing this, I resized the uniform buffer size in the following way:
void createUniformBuffers() {
VkDeviceSize bufferSize = sizeof(CameraUBO) + sizeof(Light);
...
Next, when creating the descriptor sets, I added the lightBufferInfo below the already defined bufferInfo as shown below:
...
for (size_t i = 0; i < swapChainImages.size(); i++) {
VkDescriptorBufferInfo bufferInfo = {};
bufferInfo.buffer = uniformBuffers[i];
bufferInfo.offset = 0;
bufferInfo.range = sizeof(UniformBufferObject);
VkDescriptorBufferInfo lightBufferInfo = {};
lightBufferInfo.buffer = uniformBuffers[i];
lightBufferInfo.offset = 0;
lightBufferInfo.range = sizeof(Light);
...
I then added this to the descriptorWrites array:
...
descriptorWrites[2].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
descriptorWrites[2].dstSet = descriptorSets[i];
descriptorWrites[2].dstBinding = 2;
descriptorWrites[2].dstArrayElement = 0;
descriptorWrites[2].descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
descriptorWrites[2].descriptorCount = 1;
descriptorWrites[2].pBufferInfo = &lightBufferInfo;
...
Now similarly to the UniformBufferObject I plan to use the updateUniformBuffer(uint32_t currentImage) function to change the lights position and colour, but first I just tried to set the position to a desired value:
void updateUniformBuffer(uint32_t currentImage) {
...
ubo.proj[1][1] *= -1;
Light light = {};
light.position = glm::vec3(0, 10, 10);
light.colour = glm::vec3(1, 1, 0);
void* data;
vkMapMemory(device, uniformBuffersMemory[currentImage], 0, sizeof(ubo), 0, &data);
memcpy(data, &ubo, sizeof(ubo));
vkUnmapMemory(device, uniformBuffersMemory[currentImage]);
}
I do not understand how the offset works when trying to pass two objects to a uniform buffer, so I do not know how to copy the light object to uniformBuffersMemory.
How would the offsets be defined in order to get this to work?
A note before reading further: Splitting data for a single UBO into two different structs and descriptors makes passing data a bit more complicated, as all your sizes and writes need to be aligned to the minUniformBufferAlignment property of your device, making your code a bit more complicated. If you're starting with Vulkan you may want to split the data either into two UBOs (creating two buffers), or just pass all values as a single struct.
But if you want to continue with the way you described in your post:
First you need to properly size your array, because your copies need to be aligned to minUniformBufferAlignment you probably can't just copy your light data to the area right after your other data. If your device has an minUniformBufferAlignment of 256 bytes and you want to copy over two host structs you'r uniform buffers size needs to be at least 2 * 256 bytes and not just sizeof(matrices) + sizeof(lights). So you need to adjust your bufferSize in the VkDeviceSize structure accordingly.
Next you need to offset your lightBufferInfo VkDescriptorBufferInfo:
lightBufferInfo.offset = std::max(sizeof(Light), minUniformBufferOffsetAlignment);
This will let your vertex shader know where to start fetching data for that binding.
On most NVidia GPUs e.g., minUniformBufferOffsetAlignment is 256 bytes, where as the size of your Light struct is 32 bytes. So to make this work on such a GPU you have to align at 256 bytes instead of 32.
Inspecting your setup in RenderDoc should then look similar to this:
Note that for more complex allocations and scenarios you'd need to properly get the right alignment size depending on the size of your data structure instead of using a simple max like above.
And now when updating your uniform buffers you need to map and copy to the proper offset too:
void* mapped = nullptr;
// Copy matrix data to offset for binding 0
vkMapMemory(device, uniformBuffersMemory[currentImage].memory, 0, sizeof(ubo), 0, &mapped);
memcpy(mapped, &ubo, sizeof(ubo));
vkUnmapMemory(device, uniformBuffersMemory[currentImage].memory);
// Copy light data to offset for binding 1
vkMapMemory(device, uniformBuffersMemory[currentImage].memory, std::max(sizeof(ubo), minUniformBufferOffsetAlignment), sizeof(Light), 0, &mapped);
memcpy(mapped, &uboLight, sizeof(Light));
vkUnmapMemory(device, uniformBuffersMemory[currentImage].memory);
Note that you may want to only map once after creating the buffers for performance reasons rather than mapping on every update. Just store the offset pointer somewhere in your code.