OpenGL, glMapNamedBuffer takes a long time - c++

I've been writing an openGL program that generates vertices on the GPU using compute shaders, the problem is I need to read back the number of vertices from a buffer written to by one compute shader dispatch on the CPU so that I can allocate a buffer of the right size for the next compute shader dispatch to fill with vertices.
/*
* Stage 1- Populate the 3d texture with voxel values
*/
_EvaluateVoxels.Use();
glActiveTexture(GL_TEXTURE0);
GLPrintErrors("glActiveTexture(GL_TEXTURE0);");
glBindTexture(GL_TEXTURE_3D, _RandomSeedTexture);
glBindImageTexture(2, _VoxelValuesTexture, 0, GL_TRUE, NULL, GL_READ_WRITE, GL_R32F);
_EvaluateVoxels.SetVec3("CellSize", voxelCubeDims);
SetMetaBalls(metaballs);
_EvaluateVoxels.SetVec3("StartPos", chunkPosLL);
glDispatchCompute(voxelDim.x + 1, voxelDim.y + 1, voxelDim.z + 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
/*
* Stage 2 - Calculate the marching cube's case for each cube of 8 voxels,
* listing those that contain polygons and counting the no of vertices that will be produced
*/
_GetNonEmptyVoxels.Use();
_GetNonEmptyVoxels.SetFloat("IsoLevel", isoValue);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, _IntermediateDataSSBO);
glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, _AtomicCountersBuffer);
glDispatchCompute(voxelDim.x, voxelDim.y, voxelDim.z);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT | GL_ATOMIC_COUNTER_BARRIER_BIT);
//printStage2(_IntermediateDataSSBO, true);
_StopWatch.StopTimer("stage2");
_StopWatch.StartTimer("getvertexcounter");
// this line takes a long time
unsigned int* vals = (unsigned int*)glMapNamedBuffer(_AtomicCountersBuffer, GL_READ_WRITE);
unsigned int vertex_counter = vals[1];
unsigned int index_counter = vals[0];
vals[0] = 0;
vals[1] = 0;
glUnmapNamedBuffer(_AtomicCountersBuffer);
The image below shows times in milliseconds that each stage of the code takes to run, "timer Evaluate" refers to the method as a whole, IE the sum total of the previous stages. getvertexcounter refers to only the mapping, reading and unmapping of a buffer containing the number of vertices. Please see code for more detail.
I've found this to be by far the slowest stage in the process, and I gather it has something to do with the asynchronous nature of the communication between openGL and the GPU and the need to synchronise data that was written by the compute shader so it can be read by the CPU. My question is this: Is this delay avoidable? I don't think that the overall approach is flawed because I know that someone else has implemented the algorithm in a similar way, albeit using direct X (I think).
You can find my code at https://github.com/JimMarshall35/Marching-cubes-cpp/tree/main/MarchingCubes , the code in question is in the file ComputeShaderMarcher.cpp and the method unsigned int ComputeShaderMarcher::GenerateMesh(const glm::vec3& chunkPosLL, const glm::vec3& chunkDim, const glm::ivec3& voxelDim, float isoValue, GLuint VBO)

In order to access data from a buffer that you have had OpenGL write some data to, the CPU must halt execution until the GPU has actually written that data. Whatever process you use to access this data (glMapBufferRange, glGetBufferSubData, etc), that process must halt until the GPU has finished generating the data.
So don't try to access GPU-generated data until you're sure the GPU has actually generated it (or you have absolutely nothing better to do on the CPU than wait). Use fence sync objects to test whether the GPU has finished executing past a certain point.

Related

How to count dead particles in the compute shader?

I am working on particle system. For calculation of each particle position, time alive and so on I use compute shader. I have problem to get count of dead particles back to the cpu, so I can set how many particles renderer should render. To store data of particles i use shader storage buffer. To render particles i use instancing. I tried to use atomic buffer counter, it works fine, but it is slow to copy data from buffer to the cpu. I wonder if there is some other option.
This is important part of compute shader
if (pData.timeAlive >= u_LifeTime)
{
pData.velocity = pData.defaultVelocity;
pData.timeAlive = 0;
pData.isAlive = u_Loop;
atomicCounterIncrement(deadParticles)
pVertex.position.x = pData.defaultPosition.x;
pVertex.position.y = pData.defaultPosition.y;
}
InVertex[id] = pVertex;
InData[id] = pData;
To copy data to the cpu i use following code
uint32_t* OpenGLAtomicCounter::GetCounters()
{
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, m_AC);
glGetBufferSubData(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(uint32_t) * m_NumberOfCounters, m_Counters);
return m_Counters;
}

OpenGL How does updating buffers affect speed

I have a buffer I map to vertex attributes to send. Here is the basic functionality of the code:
glBindBuffer(GL_ARRAY_BUFFER, _bufferID);
_buffer = (VertexData*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
for(Renderable* renderable : renderables){
const glm::vec3& size = renderable->getSize();
const glm::vec3& position = renderable->getPosition();
const glm::vec4& color = renderable->getColor();
const glm::mat4& modelMatrix = renderable->getModelMatrix();
glm::vec3 vertexNormal = glm::vec3(0, 1, 0);
_buffer->position = glm::vec3(modelMatrix * glm::vec4(position.x, position.y, position.z, 1));
_buffer->color = color;
_buffer->texCoords = glm::vec2(0, 0);
_buffer->normal = vertexNormal;
_buffer++;
}
and then I draw all renderables in one draw call. I am curious as to why touching the _buffer variable at all causes massive slow down in the program. For example, if I call std::cout << _buffer->position.x; every frame, my fps tanks to about 1/4th of what it usually is.
What I want to know is why it does this. The reason I want to know is because I want to be able to give translate objects in the batch when they are moved. Essentially, I want the buffer to always be in the same spot and not change but I can change it without huge sacrifices to performance. I assume this isn't possible but I would like to know why. Here is an example of what I would want to do if this didn't cause massive issues:
if(renderables.at(index)->hasChangedPosition())
_buffer+=index;
_buffer->position = renderables.at(index)->getPosition();
I am aware I can send the transforms through the shader uniform but you can't do that for batched objects in one draw call.
why touching the _buffer variable at all causes massive slow down in the program
...well, you did request a GL_WRITE_ONLY buffer; it's entirely possible that the GL driver set up the memory pages backing the pointer returned by glMapBuffer() with a custom fault handler that actually goes out to the GPU to fetch the requested bytes, which can be...not fast.
Whereas if you only write to the provided addresses the driver/OS doesn't have to do anything until the glUnmapBuffer() call, at which point it can set up a nice, fast DMA transfer to blast the new buffer contents out to GPU memory in one go.

Ring buffered SSBO with compute shader

I am performing view frustum culling and generating draw commands on the GPU in a compute shader and I want to pass the bounding volumes in a SSBO. Currently I am using just a large uniform array but I want to go bigger thus the need to move to a SSBO.
The thing I want to accomplish is something a kin to the AZDO approach of using triple buffering in order to avoid sync issues when updating the SSBO by only updating one third of the buffer while guarding the rest with fences.
Is this possible to combine with the compute shader dispatch or should I just create three different SSBOs and then bind each of them accordingly?
The solution as I currently see it would be to somehow tell the following drawcall to only fetch data in the SSBO from a certain offset (0 * buffer_size, 1 * buffer_size, etc). Is this even possible?
Render loop
/* Fence creation omitted for clarity */
// Cycle round updating different parts of the buffer
const uint32_t buffer_idx = (frame % gl_state.bvb_num_partitions);
uint8_t* ptr = (uint8_t*)gl_state.bvbp + buffer_idx * gl_state.bvb_buffer_size;
std::memcpy(ptr, bounding_volumes.data(), gl_state.bvb_buffer_size);
const uint32_t gl_bv_binding_point = 3; // Shader hard coded
const uint32_t offset = buffer_idx * gl_state.bvb_buffer_size;
glBindBufferRange(GL_SHADER_STORAGE_BUFFER, gl_bv_binding_point, gl_state.bvb, offset, gl_state.bvb_buffer_size);
// OLD WAY: glUniform4fv(glGetUniformLocation(gl_state.cull_shader.gl_program, "spheres"), NUM_OBJECTS, &bounding_volumes[0].pos.x);
glUniform4fv(glGetUniformLocation(gl_state.cull_shader.gl_program, "frustum_planes"), 6, glm::value_ptr(frustum[0]));
glDispatchCompute(NUM_OBJECTS, 1, 1);
glMemoryBarrier(GL_COMMAND_BARRIER_BIT | GL_SHADER_STORAGE_BARRIER_BIT); // Buffer objects affected by this bit are derived from the GL_DRAW_INDIRECT_BUFFER binding.
Bounding volume SSBO creation
// Bounding volume buffer
glGenBuffers(1, &gl_state.bvb);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, gl_state.bvb);
gl_state.bvb_buffer_size = NUM_OBJECTS * sizeof(BoundingVolume);
gl_state.bvb_num_partitions = 3; // 1 for application, 1 for OpenGL driver, 1 for GPU
GLbitfield flags = GL_MAP_COHERENT_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT;
glBufferStorage(GL_SHADER_STORAGE_BUFFER, gl_state.bvb_num_partitions * gl_state.bvb_buffer_size, nullptr, flags);
gl_state.bvbp = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, gl_state.bvb_buffer_size * gl_state.bvb_num_partitions, flags);

OpenGL Merging Vertex Data to Minimize Draw Calls

Background
2D "Infinite" World separated into chunks
One VAO (& VBO/EBO) per chunk
Nested for loop in chunk render; one draw call per block.
Code
void Chunk::Render(/* ... */) {
glBindVertexArray(vao);
for (int x = 0; x < 64; x++) {
for (int y = 0; y < 64; y++) {
if (blocks[x][y] == 1) {
/* ... Uniforms ... */
glDrawElements(GL_TRIANGLE_STRIP, 6, GL_UNSIGNED_INT, (void*)0);
}
}
}
glBindVertexArray(0);
}
There is a generation algorithm in the constructor. This could be anything: noise, random, etc. The algorithm goes through and sets an element in the blocks array to 1 (meaning: render block) or 0 (meaning: do not render)
Problem
How would I go about combining these triangle strips together in order to minimize draw calls? I can think of a few algorithms to find the triangles that should be merged together in a draw call, but I am confused as how to merge them together. Do I need to add it to the vertices array and call glBufferData again? Would it be bad to call glBufferData so many times per-frame?
I'm not really rendering that many triangles, am I? I think I've heard of people who can easily draw ten-thousand triangles with minimal CPU usage (or.. millions even). So what is wrong with how I am drawing currently?
EDIT
_[Andon M. Coleman][1]_ has given me a lot of information in the [chat][2]. I have now switched over to using instanced arrays; I cannot believe how much of a difference it makes in performance, for a minute I thought Linux's `top` command was malfunctioning. It's _very_ significant. Instead of only being able to render say.. 60 triangles, I can render over a million with barely any change in CPU usage.

CUDA + OpenGL. Unknown code=4(cudaErrorLaunchFailure) error

I am doing a simple n-body simulation on CUDA, which then I am trying to visualize with OpenGL.
After I have initialitzed my particle data on the CPU, allocated the respective memory and transfered that data on the GPU, the program has to enter the following cycle:
1) Compute the forces on each particle (CUDA part)
2) update particle positions (CUDA part)
3) display the particles for this time step (OpenGL part)
4) go back to 1)
The interface between CUDA and OpenGL I achieve with the following code:
GLuint dataBufferID;
particle_t* Particles_d;
particle_t* Particles_h;
cudaGraphicsResource *resources[1];
I allocate space on OpenGLs Array_Buffer and register the latter as a cudaGraphicsResource using the following code:
void createVBO()
{
// create buffer object
glGenBuffers(1, &dataBufferID);
glBindBuffer(GL_ARRAY_BUFFER, dataBufferID);
glBufferData(GL_ARRAY_BUFFER, bufferStride*N*sizeof(float), 0, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
checkCudaErrors(cudaGraphicsGLRegisterBuffer(resources, dataBufferID, cudaGraphicsMapFlagsNone));
}
Lastly, the program cycle that I described (steps 1 to 4) is realized by the following function update(int)
void update(int value)
{
// map OpenGL buffer object for writing from CUDA
float* dataPtr;
checkCudaErrors(cudaGraphicsMapResources(1, resources, 0));
size_t num_bytes;
//get a pointer to that buffer object for manipulation with cuda!
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&dataPtr, &num_bytes,resources[0]));
//fill the Graphics Resource with particle position Data!
launch_kernel<<<NUM_BLOCKS,NUM_THREADS>>>(Particles_d,dataPtr,1);
// unmap buffer object
checkCudaErrors(cudaGraphicsUnmapResources(1, resources, 0));
glutPostRedisplay();
glutTimerFunc(milisec,update,0);
}
I compile end I get the following errors :
CUDA error at src/main.cu:390 code=4(cudaErrorLaunchFailure) "cudaGraphicsMapResources(1, resources, 0)"
CUDA error at src/main.cu:392 code=4(cudaErrorLaunchFailure) "cudaGraphicsResourceGetMappedPointer((void **)&dataPtr, &num_bytes,resources[0])"
CUDA error at src/main.cu:397 code=4(cudaErrorLaunchFailure) "cudaGraphicsUnmapResources(1, resources, 0)"
Does anyone know what might be the reasons for that exception? Am I supposed to create the dataBuffer using createVBO() every time prior to the execution of update(int) ...?
p.s. Just for more clarity, my kernel function is the following:
__global__ void launch_kernel(particle_t* Particles,float* data, int KernelMode){
int i = blockIdx.x*THREADS_PER_BLOCK + threadIdx.x;
if(KernelMode == 1){
//N_d is allocated on device memory
if(i > N_d)
return;
//and update dataBuffer!
updateX(Particles+i);
for(int d=0;d<DIM_d;d++){
data[i*bufferStride_d+d] = Particles[i].p[d]; // update the new coordinate positions in the data buffer!
}
// fill in also the RGB data and the radius. In general THIS IS NOT NECESSARY!! NEED TO PERFORM ONCE! REFACTOR!!!
data[i*bufferStride_d+DIM_d] =Particles[i].r;
data[i*bufferStride_d+DIM_d+1] =Particles[i].g;
data[i*bufferStride_d+DIM_d+2] =Particles[i].b;
data[i*bufferStride_d+DIM_d+3] =Particles[i].radius;
}else{
// if KernelMode = 2 then Update Y
float* Fold = new float[DIM_d];
for(int d=0;d<DIM_d;d++)
Fold[d]=Particles[i].force[d];
//of course in parallel :)
computeForces(Particles,i);
updateV(Particles+i,Fold);
delete [] Fold;
}
// in either case wait for all threads to finish!
__syncthreads();
}
As I mentioned at one of the comments above , it turned out that I had mistaken the computing capability compiler option. I ran cuda-memcheck and I saw that that the cuda Api launch was failing. After I found the right compiler options, everything worked like a charm.