SDL_GL_SwapWindow bad performance - c++

I did some performance testing and came up with this:
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
for(U32 i=0;i<objectList.length();++i)
{
PC d("draw");
VoxelObject& obj = *objectList[i];
glBindVertexArray(obj.vao);
tmpM = usedView->projection * usedView->transform * obj.transform;
glUniformMatrix4fv(shader.modelViewMatrixLoc, 1, GL_FALSE, tmpM.data());
//glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, typesheet.tbo);
glUniform1i(shader.typesheetLoc, 0);
glDrawArrays(GL_TRIANGLES, 0, VoxelObject::VERTICES_PER_BOX*obj.getNumBoxes());
d.out(); // 2 calls 0.000085s and 0.000043s each
}
PC swap("swap");
SDL_GL_SwapWindow(mainWindow); // 1 call 0.007823s
swap.out();
The call to SDL_GL_SwapWindow(mainWindow); is taking 200 times longer than the draw calls! To my understanding i thought all that function was supposed to do was swap buffers. That would mean that the time it takes to swap would scale depending on the screen size right? No it scales based on the amount of geometry... I did some searching online, I have double buffering enable and vsync is turned off. I am stumped.

Your OpenGL driver is likely doing deferred rendering.
That means the calls to the glDrawArrays and friends don't draw anything. Instead they buffer all required information to perform the operation later on.
The actual rendering happens inside SDL_GL_SwapWindow.
This behavior is typical these days because you want to avoid having to synchronize between the CPU and the GPU as much as possible.

Related

OpenGL, glMapNamedBuffer takes a long time

I've been writing an openGL program that generates vertices on the GPU using compute shaders, the problem is I need to read back the number of vertices from a buffer written to by one compute shader dispatch on the CPU so that I can allocate a buffer of the right size for the next compute shader dispatch to fill with vertices.
/*
* Stage 1- Populate the 3d texture with voxel values
*/
_EvaluateVoxels.Use();
glActiveTexture(GL_TEXTURE0);
GLPrintErrors("glActiveTexture(GL_TEXTURE0);");
glBindTexture(GL_TEXTURE_3D, _RandomSeedTexture);
glBindImageTexture(2, _VoxelValuesTexture, 0, GL_TRUE, NULL, GL_READ_WRITE, GL_R32F);
_EvaluateVoxels.SetVec3("CellSize", voxelCubeDims);
SetMetaBalls(metaballs);
_EvaluateVoxels.SetVec3("StartPos", chunkPosLL);
glDispatchCompute(voxelDim.x + 1, voxelDim.y + 1, voxelDim.z + 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
/*
* Stage 2 - Calculate the marching cube's case for each cube of 8 voxels,
* listing those that contain polygons and counting the no of vertices that will be produced
*/
_GetNonEmptyVoxels.Use();
_GetNonEmptyVoxels.SetFloat("IsoLevel", isoValue);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, _IntermediateDataSSBO);
glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, _AtomicCountersBuffer);
glDispatchCompute(voxelDim.x, voxelDim.y, voxelDim.z);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT | GL_ATOMIC_COUNTER_BARRIER_BIT);
//printStage2(_IntermediateDataSSBO, true);
_StopWatch.StopTimer("stage2");
_StopWatch.StartTimer("getvertexcounter");
// this line takes a long time
unsigned int* vals = (unsigned int*)glMapNamedBuffer(_AtomicCountersBuffer, GL_READ_WRITE);
unsigned int vertex_counter = vals[1];
unsigned int index_counter = vals[0];
vals[0] = 0;
vals[1] = 0;
glUnmapNamedBuffer(_AtomicCountersBuffer);
The image below shows times in milliseconds that each stage of the code takes to run, "timer Evaluate" refers to the method as a whole, IE the sum total of the previous stages. getvertexcounter refers to only the mapping, reading and unmapping of a buffer containing the number of vertices. Please see code for more detail.
I've found this to be by far the slowest stage in the process, and I gather it has something to do with the asynchronous nature of the communication between openGL and the GPU and the need to synchronise data that was written by the compute shader so it can be read by the CPU. My question is this: Is this delay avoidable? I don't think that the overall approach is flawed because I know that someone else has implemented the algorithm in a similar way, albeit using direct X (I think).
You can find my code at https://github.com/JimMarshall35/Marching-cubes-cpp/tree/main/MarchingCubes , the code in question is in the file ComputeShaderMarcher.cpp and the method unsigned int ComputeShaderMarcher::GenerateMesh(const glm::vec3& chunkPosLL, const glm::vec3& chunkDim, const glm::ivec3& voxelDim, float isoValue, GLuint VBO)
In order to access data from a buffer that you have had OpenGL write some data to, the CPU must halt execution until the GPU has actually written that data. Whatever process you use to access this data (glMapBufferRange, glGetBufferSubData, etc), that process must halt until the GPU has finished generating the data.
So don't try to access GPU-generated data until you're sure the GPU has actually generated it (or you have absolutely nothing better to do on the CPU than wait). Use fence sync objects to test whether the GPU has finished executing past a certain point.

Opengl Large SSBO hangs when reading

I have multiple SSBO(4 SSBO) of size 400400100 ints and my fragment shader updates the value of this ssbo. I do this by calling the drawelements call and I read the data after the draw call by first binding the specific SSBO and then calling the glMapBuffer and type casting the ptr to an int.
The GPU does heavy processing (loops with 10000 iteration) and updates the SSBO. I have a print statemnt after the drawelement call which is shown on the screen and a print statement after the bind call, which is also displayed but the glMapBuffer call takes forever and hangs the system.
In Windows task manager, the GPU is not used for majority of the time and only CPU is used. I think its because the GPU is only used during the draw call.
Plus, Is my understanding correct that when I call glMapBuffer, only the binded ssbo is transfered?
Do you guys have any suggestion as to what might be causing this issue?
I tried using glmapbuffer range, which caused similar problem.
glDrawElements(GL_TRIANGLES, NUM_OF_TRIANGLES , GL_UNSIGNED_INT, 0);
std::cout << 'done rendering' <<std::endl; //prints this out
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, 0);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, SSBO);
std::cout << 'done binding ssbo' <<std::endl; //prints this out
GLint *ptr;
ptr = (GLint *) glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY); //hangs here

OpenGL glReadPixels Performance

I am trying to implement Auto Exposure for HDR Tone mapping and I am trying to reduce the cost of finding the average brightness of my scene and I've seemed to hit a choke point with glReadPixels. Here is my setup:
1: I create a downsampled FBO to reduce the cost of reading when using glReadPixelsusing only the GL_RED values and in GL_BYTE format.
private void CreateDownSampleExposure() {
DownFrameBuffer = glGenFramebuffers();
DownTexture = GL11.glGenTextures();
glBindFramebuffer(GL_FRAMEBUFFER, DownFrameBuffer);
GL11.glBindTexture(GL11.GL_TEXTURE_2D, DownTexture);
GL11.glTexImage2D(GL11.GL_TEXTURE_2D, 0, GL11.GL_RED, 1600/8, 1200/8,
0, GL11.GL_RED, GL11.GL_BYTE, (ByteBuffer) null);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL11.GL_TEXTURE_2D, DownTexture, 0);
if (glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE) {
System.err.println("error");
} else {
System.err.println("success");
}
GL11.glBindTexture(GL11.GL_TEXTURE_2D, 0);
glBindFramebuffer(GL_FRAMEBUFFER, 0);
}
2: Setting up the ByteBuffers and reading the texture of the FBO texture Above.
Setup(){
byte[] testByte = new byte[1600/8*1000/8];
ByteBuffer testByteBuffer = BufferUtils.createByteBuffer(testByte.length);
testByteBuffer.put(testByte);
testByteBuffer.flip();
}
MainLoop(){
//Render scene and store result into downSampledFBO texture
GL11.glBindTexture(GL11.GL_TEXTURE_2D, DeferredFBO.getDownTexture());
//GL11.glGetTexImage(GL11.GL_TEXTURE_2D, 0, GL11.GL_RED, GL11.GL_BYTE,
//testByteBuffer); <- This is slower than readPixels.
GL11.glReadPixels(0, 0, DisplayManager.Width/8, DisplayManager.Height/8,
GL11.GL_RED, GL11.GL_BYTE, testByteBuffer);
int x = 0;
for(int i = 0; i <testByteBuffer.capacity(); i++){
x+= testByteBuffer.get(i);
}
System.out.println(x); <-Print out accumulated value of brightness.
}
//Adjust exposure depending on brightness.
The problem is, I can downsample my FBO texture by a factor of 100, so my glReadPixelsreads only 16x10 pixels and there is little to no performance gain. There is a substantial performance gain from no downsampling but once I get past around dividing the width and height by 8 it seems to fall off. It seems like there is such a huge overhead of just calling this function. Is there something I am doing incorrectly or not considering when calling glReadPixels?.
glReadPixels is slow because the CPU must wait until the GPU has finished all of it's rendering before it can give you the results. The dreaded sync point.
One way to make glReadPixels fast is to use some sort of double/triple buffering scheme, so that you only call glReadPixels on render-to-textures that you expect the GPU has already finished with. This is only viable if waiting a couple of frames before receiving the result of glReadPixels is acceptable in your application. For example, in a video game the latency could be justified as a simulation of the pupil's response time to a change in lighting conditions.
However, for your particular tone-mapping example, presumably you want to calculate the average brightness only to feed that information back into the GPU for another rendering pass. Instead of glReadPixels, calculate the average by copying your image to successively half-sized render targets with linear filtering (a box filter), until you're down to a 1x1 target.
That 1x1 target is now a texture containing your average brightness and can use that texture in your tone-mapping rendering pass. No sync points.

OpenGL Merging Vertex Data to Minimize Draw Calls

Background
2D "Infinite" World separated into chunks
One VAO (& VBO/EBO) per chunk
Nested for loop in chunk render; one draw call per block.
Code
void Chunk::Render(/* ... */) {
glBindVertexArray(vao);
for (int x = 0; x < 64; x++) {
for (int y = 0; y < 64; y++) {
if (blocks[x][y] == 1) {
/* ... Uniforms ... */
glDrawElements(GL_TRIANGLE_STRIP, 6, GL_UNSIGNED_INT, (void*)0);
}
}
}
glBindVertexArray(0);
}
There is a generation algorithm in the constructor. This could be anything: noise, random, etc. The algorithm goes through and sets an element in the blocks array to 1 (meaning: render block) or 0 (meaning: do not render)
Problem
How would I go about combining these triangle strips together in order to minimize draw calls? I can think of a few algorithms to find the triangles that should be merged together in a draw call, but I am confused as how to merge them together. Do I need to add it to the vertices array and call glBufferData again? Would it be bad to call glBufferData so many times per-frame?
I'm not really rendering that many triangles, am I? I think I've heard of people who can easily draw ten-thousand triangles with minimal CPU usage (or.. millions even). So what is wrong with how I am drawing currently?
EDIT
_[Andon M. Coleman][1]_ has given me a lot of information in the [chat][2]. I have now switched over to using instanced arrays; I cannot believe how much of a difference it makes in performance, for a minute I thought Linux's `top` command was malfunctioning. It's _very_ significant. Instead of only being able to render say.. 60 triangles, I can render over a million with barely any change in CPU usage.

Simple curiosity about performance using OpenGL and GLSL

I develop a small 3D engine using OpenGL and GLSL.
Here's a part of the rendering code :
void video::RenderBatch::Render(void)
{
type::EffectPtr pShaderEffect = EffectManager::GetSingleton()
.FindEffectByName(this->m_pMaterial->GetAssocEffectName());
pShaderEffect->Bind();
{
///VERTEX ATTRIBUTES LOCATIONS.
{
pShaderEffect->BindAttribLocation(scene::VERTEX_POSITION, "VertexPosition");
pShaderEffect->BindAttribLocation(scene::VERTEX_TEXTURE, "VertexTexture");
pShaderEffect->BindAttribLocation(scene::VERTEX_NORMAL, "VertexNormal");
}
//SEND MATRIX UNIFORMS.
{
glm::mat3 normalMatrix = glm::mat3(glm::vec3(this->m_ModelViewMatrix[0]),
glm::vec3(this->m_ModelViewMatrix[1]), glm::vec3(this->m_ModelViewMatrix[2]));
pShaderEffect->SetUniform("ModelViewProjMatrix", this->m_ModelViewProjMatrix);
pShaderEffect->SetUniform("ModelViewMatrix", this->m_ModelViewMatrix);
pShaderEffect->SetUniform("NormalMatrix", normalMatrix);
}
this->SendLightUniforms(pShaderEffect); //LIGHT MATERIALS TO BE SENT JUST ONCE */
pShaderEffect->SendMaterialUniforms( //SEND MATERIALS IF CHANGED
this->m_pMaterial->GetName());
this->m_pVertexArray->Lock();
{
this->m_pIndexBuffer->Lock();
{
RenderData renderData = this->GetVisibleGeometryData();
{
glMultiDrawElements(GL_TRIANGLES, (GLsizei*)&renderData.count[0], GL_UNSIGNED_INT,
(const GLvoid **)&renderData.indices[0], renderData.count.size());
}
}
this->m_pIndexBuffer->Unlock();
}
this->m_pVertexArray->Unlock();
}
pShaderEffect->Release();
}
I noticed the call of the function 'SetUniform' creates a great loss of FPS (more than 1000 FPS without it to +- 65 FPS with it!). Just ONE simple call of this function suffice!
Here's the code of the function 'this->SetUniform' (for matrix 4x4):
void video::IEffectBase::SetUniform(char const *pName, glm::mat4 mat)
{
int location = glGetUniformLocation(this->m_Handle, pName);
if (location >= 0)
glUniformMatrix4fv(location, 1, GL_FALSE, glm::value_ptr(mat));
}
In reality just the call of the function 'glGetUniformLocation' or the function 'glUniformMatrix4fv' suffice to have a such loss of FPS. Is it normal to go over 1000 FPS to 65 FPS with a unique call of this function ? However buffer binding or shader program binding don't have a such effect! (if I comment all the 'SetUniform' calls I still have more than 1000 FPS even with all the bindings (state changes)!).
So, to sum up the situation, all the functions I need to send uniform informations to the shader program (matrix and material data and so on...) seem to have a huge impact to the frame rate. However in this example my scene is only composed of a unique cube mesh! Nothing terrible to render for the GPU!
But I don't think the problem comes from the GPU because the impact of my program on it is just laughable (according to 'GPUShark'):
Only 6%! But just the display of the window (without the geometry) suffices to reach 6%! So the rendering of my cube have almost none impact on the GPU. So I think the problem comes from the CPU/GPU data transfer... I think it's normal to have a loss of performance using these function but go from more than 1000 FPS to 65 FPS it's incredible! And just to draw a simple geometry!
Is there a way to have better performance or is it normal to have a such loss of FPS using this technique of sending data?
What do you think about that ?
Thank you very much for your help!
Don't call glGetUniformLocation every time you need to set a uniform's value. Uniform locations don't change for a given shader (unless you recompile it), so look up the uniforms once after compiling the shader and save the location values for use in your Render function.