OpenGL Merging Vertex Data to Minimize Draw Calls - c++

Background
2D "Infinite" World separated into chunks
One VAO (& VBO/EBO) per chunk
Nested for loop in chunk render; one draw call per block.
Code
void Chunk::Render(/* ... */) {
glBindVertexArray(vao);
for (int x = 0; x < 64; x++) {
for (int y = 0; y < 64; y++) {
if (blocks[x][y] == 1) {
/* ... Uniforms ... */
glDrawElements(GL_TRIANGLE_STRIP, 6, GL_UNSIGNED_INT, (void*)0);
}
}
}
glBindVertexArray(0);
}
There is a generation algorithm in the constructor. This could be anything: noise, random, etc. The algorithm goes through and sets an element in the blocks array to 1 (meaning: render block) or 0 (meaning: do not render)
Problem
How would I go about combining these triangle strips together in order to minimize draw calls? I can think of a few algorithms to find the triangles that should be merged together in a draw call, but I am confused as how to merge them together. Do I need to add it to the vertices array and call glBufferData again? Would it be bad to call glBufferData so many times per-frame?
I'm not really rendering that many triangles, am I? I think I've heard of people who can easily draw ten-thousand triangles with minimal CPU usage (or.. millions even). So what is wrong with how I am drawing currently?
EDIT
_[Andon M. Coleman][1]_ has given me a lot of information in the [chat][2]. I have now switched over to using instanced arrays; I cannot believe how much of a difference it makes in performance, for a minute I thought Linux's `top` command was malfunctioning. It's _very_ significant. Instead of only being able to render say.. 60 triangles, I can render over a million with barely any change in CPU usage.

Related

OpenGL, glMapNamedBuffer takes a long time

I've been writing an openGL program that generates vertices on the GPU using compute shaders, the problem is I need to read back the number of vertices from a buffer written to by one compute shader dispatch on the CPU so that I can allocate a buffer of the right size for the next compute shader dispatch to fill with vertices.
/*
* Stage 1- Populate the 3d texture with voxel values
*/
_EvaluateVoxels.Use();
glActiveTexture(GL_TEXTURE0);
GLPrintErrors("glActiveTexture(GL_TEXTURE0);");
glBindTexture(GL_TEXTURE_3D, _RandomSeedTexture);
glBindImageTexture(2, _VoxelValuesTexture, 0, GL_TRUE, NULL, GL_READ_WRITE, GL_R32F);
_EvaluateVoxels.SetVec3("CellSize", voxelCubeDims);
SetMetaBalls(metaballs);
_EvaluateVoxels.SetVec3("StartPos", chunkPosLL);
glDispatchCompute(voxelDim.x + 1, voxelDim.y + 1, voxelDim.z + 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
/*
* Stage 2 - Calculate the marching cube's case for each cube of 8 voxels,
* listing those that contain polygons and counting the no of vertices that will be produced
*/
_GetNonEmptyVoxels.Use();
_GetNonEmptyVoxels.SetFloat("IsoLevel", isoValue);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, _IntermediateDataSSBO);
glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, _AtomicCountersBuffer);
glDispatchCompute(voxelDim.x, voxelDim.y, voxelDim.z);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT | GL_ATOMIC_COUNTER_BARRIER_BIT);
//printStage2(_IntermediateDataSSBO, true);
_StopWatch.StopTimer("stage2");
_StopWatch.StartTimer("getvertexcounter");
// this line takes a long time
unsigned int* vals = (unsigned int*)glMapNamedBuffer(_AtomicCountersBuffer, GL_READ_WRITE);
unsigned int vertex_counter = vals[1];
unsigned int index_counter = vals[0];
vals[0] = 0;
vals[1] = 0;
glUnmapNamedBuffer(_AtomicCountersBuffer);
The image below shows times in milliseconds that each stage of the code takes to run, "timer Evaluate" refers to the method as a whole, IE the sum total of the previous stages. getvertexcounter refers to only the mapping, reading and unmapping of a buffer containing the number of vertices. Please see code for more detail.
I've found this to be by far the slowest stage in the process, and I gather it has something to do with the asynchronous nature of the communication between openGL and the GPU and the need to synchronise data that was written by the compute shader so it can be read by the CPU. My question is this: Is this delay avoidable? I don't think that the overall approach is flawed because I know that someone else has implemented the algorithm in a similar way, albeit using direct X (I think).
You can find my code at https://github.com/JimMarshall35/Marching-cubes-cpp/tree/main/MarchingCubes , the code in question is in the file ComputeShaderMarcher.cpp and the method unsigned int ComputeShaderMarcher::GenerateMesh(const glm::vec3& chunkPosLL, const glm::vec3& chunkDim, const glm::ivec3& voxelDim, float isoValue, GLuint VBO)
In order to access data from a buffer that you have had OpenGL write some data to, the CPU must halt execution until the GPU has actually written that data. Whatever process you use to access this data (glMapBufferRange, glGetBufferSubData, etc), that process must halt until the GPU has finished generating the data.
So don't try to access GPU-generated data until you're sure the GPU has actually generated it (or you have absolutely nothing better to do on the CPU than wait). Use fence sync objects to test whether the GPU has finished executing past a certain point.

How to draw a terrain model efficiently from Esri Grid (osg)?

I have many Esri Grid files (https://en.wikipedia.org/wiki/Esri_grid#ASCII) and I would like to render them in 3D without losing precision, I am using OpenSceneGraph.
The problem is this grids are around 1000x1000 (or more) points, so when I extract the vertices, then compute the triangles to create the geometry, I end up having millions of them and the interaction with the scene is impossible (frame rate drops to 0).
I've tried several approches:
Triangle list
Basically, as I read the file, I fill an array with 3 vertices per triangle (this leads to duplication);
osg::ref_ptr<osg::Geode> l_pGeodeSurface = new osg::Geode;
osg::ref_ptr<osg::Geometry> l_pGeometrySurface = new osg::Geometry;
osg::ref_ptr<osg::Vec3Array> l_pvTrianglePoints = osg::Vec3Array;
osg::ref_ptr<osg::Vec3Array> l_pvOriginalPoints = osg::Vec3Array;
... // Read the file and fill l_pvOriginalPoints
for(*triangle inside the file*)
{
... // Compute correct triangle indices (l_iP1, l_iP2, l_iP3)
// Push triangle vertices inside the array
l_pvTrianglePoints->push_back(l_pvOriginalPoints->at(l_iP1));
l_pvTrianglePoints->push_back(l_pvOriginalPoints->at(l_iP2));
l_pvTrianglePoints->push_back(l_pvOriginalPoints->at(l_iP3));
}
l_pGeometrySurface->setVertexArray(l_pvTrianglePoints);
l_pGeometrySurface->addPrimitiveSet(new osg::DrawArrays(GL_TRIANGLES, 0, 3, l_pvTrianglePoints->size()));
Indexed triangle list
Same as before, but the array contains the every vertices just once and I create a second array of indices (basically i tell osg how to build triangles, no duplication)
osg::ref_ptr<osg::Geode> l_pGeodeSurface = new osg::Geode;
osg::ref_ptr<osg::Geometry> l_pGeometrySurface = new osg::Geometry;
osg::ref_ptr<osg::DrawElementsUInt> l_pIndices = new osg::DrawElementsUInt(osg::PrimitiveSet::TRIANGLES, *number of indices*);
osg::ref_ptr<osg::Vec3Array> l_pvOriginalPoints = osg::Vec3Array;
... // Read the file and fill l_pvOriginalPoints
for(i = 0; i < *number of indices*; i++)
{
... // Compute correct triangle indices (l_iP1, l_iP2, l_iP3)
// Push vertices indices inside the array
l_pIndices->at(i) = l_iP1;
l_pIndices->at(i+1) = l_iP2;
l_pIndices->at(i+2) = l_iP3;
}
l_pGeometrySurface->setVertexArray(l_pvOriginalPoints );
l_pGeometrySurface->addPrimitiveSet(l_pIndices.get());
Instancing
this was a bit of an experiment, since I've never used shaders, I tought I could instance a single triangle, then manipulate its coordinates in a vertex shader for every triangle in my scene, using transformation matrices (passing the matrices as a uniform array, one for triangle). I ended up with too many uniforms just with a grid 20x20.
I used these links as a reference:
https://learnopengl.com/Advanced-OpenGL/Instancing,
https://books.google.it/books?id=x_RkEBIJeFQC&pg=PT265&lpg=PT265&dq=osg+instanced+geometry&source=bl&ots=M8ii8zn8w7&sig=ACfU3U0_92Z5EGCyOgbfGweny4KIUfqU8w&hl=en&sa=X&ved=2ahUKEwj-7JD0nq7qAhUXxMQBHcLaAiUQ6AEwAnoECAkQAQ#v=onepage&q=osg%20instanced%20geometry&f=false
None of the above solved my issue, what else can I try? Am I missing something in terms of rendering techinques? I thought it was fairly simple task, but I'm kind of stuck.
I feel like you should consider taking a step back. If you're visualizing GIS-based terrain data, osgEarth is really designed for doing this and has fairly efficient LOD tools for large terrains. Do you need the data always represented at maximum full LOD or are you looking for dynamic LOD to improve frame rate?
Depending on your goals and requirements you might want to look at some more advanced terrain rendering techniques, like rightfield tracing, etc. If the terrain is always static, you can precompute quadtrees and Signed Distance Functions and trace against the heightfield.

Can you modify a uniform from within the shader? If so. how?

So I wanted to store all my meshes in one large VBO. The problem is, how do you do have just one draw call, but let every mesh have its own model to world matrix?
My idea was to submit an array of matrices to a uniform before drawing. In the VBO I would make the color of every first vertex of a mesh negative (So I'd be using the signing bit to check whether a vertex was the first of a mesh).
Okay, so I can detect when a new mesh has started and I have an array of matrices ready and probably a uniform called 'index'. But how do I increase this index by one every time I encounter a new mesh?
Can you modify a uniform from within the shader? If so, how?
Can you modify a uniform from within the shader?
If you could, it wouldn't be uniform anymore, would it?
Furthermore, what you're wanting to do cannot be done even with Image Load/Store or SSBOs, both of which allow shaders to write data. It won't work because vertex shader invocations are not required to be executed sequentially. Many happen at the same time, and there's no way for any shader invocation to know that it will happen "after" the "first vertex" in a mesh.
The simplest way to deal with this is the obvious solution. Render each mesh individually, but set the uniforms for each mesh before each draw call. Without changing buffers between draws, of course. Uniform changes, while not exactly cheap, aren't the most expensive state changes that exist.
There are more complicated drawing methods that could allow you more performance. But that form is adequate for most needs. You've already done the hard part: you removed the need for any state change (textures, buffers, vertex formats, etc) except uniform state.
There are two approaches to minimize draw calls - instancing and batching. The first (instancing) allows you to draw multiple copies of same meshes in one draw call, but it depends on the API (is available from OpenGL 3.1). Batching is similar to instancing but allows you to draw different meshes. Both of these approaches have restrictions - meshes should be with the same materials and shaders.
If you would to draw different meshes in one VBO then instancing is not an option. So, batching requires keeping all meshes in 'big' VBO with applied world transform. It not a problem with static meshes, but have some discomfort with animated. I give you some pseudocode with batching implementation
struct SGeometry
{
uint64_t offsetVB;
uint64_t offsetIB;
uint64_t sizeVB;
uint64_t sizeIB;
glm::mat4 oldTransform;
glm::mat4 transform;
}
std::vector<SGeometry> cachedGeometries;
...
void CommitInstances()
{
uint64_t vertexOffset = 0;
uint64_t indexOffset = 0;
for (auto instance in allInstances)
{
Copy(instance->Vertexes(), VBO);
for (uint64_t i = 0; i < instances->Indices().size(); ++i)
{
auto index = instances->Indices()[i];
index += indexOffset;
IBO[i] = index;
}
cachedGeometries.push_back({vertexOffset, indexOffset});
vertexOffset += instance->Vertexes().size();
indexOffset += instance->Indices().size();
}
Commit(VBO);
Commit(IBO);
}
void ApplyTransform(glm::mat4 modelMatrix, uint64_t instanceId)
{
const SGeometry& geom = cachedGeometries[i];
glm::mat4 inverseOldTransform = glm::inverse(geom.oldTransform);
VertexStream& stream = VBO->GetStream(Position, geom.offsetVB);
for (uint64_t i = 0; i < geom.sizeVB; ++i)
{
glm::vec3 pos = stream->Get(i);
// We need to revert absolute transformation before applying new
pos = glm::vec3(inverseOldNormalTransform * glm::vec4(pos, 1.0f));
pos = glm::vec3(normalTransform * glm::vec4(pos, 1.0f));
stream->Set(i);
}
// .. Apply normal transformation
}
GPU Gems 2 has a good article about geometry instancing http://www.amazon.com/GPU-Gems-Programming-High-Performance-General-Purpose/dp/0321335597

2D Sprite animation techniques with OpenGL

I'm currently trying to setup a 2D sprite animation with OpenGL 4.
For example, I've designed a ball smoothly rotating with Gimp. There are about 32 frames ( 8 frames on 4 rows).
I aim to create a sprite atlas within a 2D texture and store my sprite data in buffers (VBO). My sprite rectangle would be always the same ( i.e. rect(0,0,32,32) ) but my texture coordinates will change each time the frame index is incremented.
I wonder how to modify the coordinates.
As the sprite tiles are stored on several rows if appears to be difficult to manage it in the shader.
Modify the sprite texture coordinate within the buffer using glBufferSubData() ?
I spent a lot of time with OpenGL 1.x....and I get back to OpenGL few months ago and I realized many things changed though. I will try several options though, but your suggestions and experience are welcome.
As the sprite tiles are stored on several rows if appears to be
difficult to manage it in the shader.
Not really, all your sprites are the same size, so you get a perfect uniform grid, and going from some 1D index to 2D is just a matter of division and modulo. Not really hard.
However, why do you even store the single frames in an mxn grid? Now you could store them just in one row. However, in modern GL, we have array textures. These are basically a set of independent 2D layers, all of the same size. You just access them by a 3D coordinate, with the third coordinate being the layer from o to n-1. This is ideally suited for your use case, and will eliminate any issues of texture filtering/bleeding at the borders, and it also will work well with mipmapping (if you need that). When array textures were introduced, the minumim number of layers an implementation is required to support was 64 (it is much higher nowadays), so 32 frames will be a piece of cake even for old GPUs.
You could do this a million ways but I'm going to propose a naive solution:
Create a VBO with 32(frame squares)*2(triangles per frame square)*3(triangle vertices)*5(x,y,z, u,v per vertex) = 960 floats of space. Fill it in with the vertices of all your sprites in a 2 triangler-per frame fashion.
Now according to the docs of glDrawArrays, you can specify where you start and how long you render for. Using this you can specify the following:
int indicesPerFrame = 960/32;
int indexToStart = indicesPerFrame*currentBallFrame;
glDrawArrays( GL_TRIANGLES, indexToStart, indicesPerFrame);
No need to modify the VBO. Now from my point of view, this is overkill to just render 32 frames 1 frame at a time. There are better solutions to this problem but this is the simplest for learning OpenGL4.
In OpenGL 2.1, I'm using your 2nd option:
void setActiveRegion(int regionIndex)
{
UVs.clear();
int numberOfRegions = (int) textureSize / spriteWidth;
float uv_x = (regionIndex % numberOfRegions)/numberOfRegions;
float uv_y = (regionIndex / numberOfRegions)/numberOfRegions;
glm::vec2 uv_up_left = glm::vec2( uv_x , uv_y );
glm::vec2 uv_up_right = glm::vec2( uv_x+1.0f/numberOfRegions, uv_y );
glm::vec2 uv_down_right = glm::vec2( uv_x+1.0f/numberOfRegions, (uv_y + 1.0f/numberOfRegions) );
glm::vec2 uv_down_left = glm::vec2( uv_x , (uv_y + 1.0f/numberOfRegions) );
UVs.push_back(uv_up_left );
UVs.push_back(uv_down_left );
UVs.push_back(uv_up_right );
UVs.push_back(uv_down_right);
UVs.push_back(uv_up_right);
UVs.push_back(uv_down_left);
glBindBuffer(GL_ARRAY_BUFFER, uvBuffer);
glBufferSubData(GL_ARRAY_BUFFER, 0, UVs.size() * sizeof(glm::vec2), &UVs[0]);
glBindBuffer(GL_ARRAY_BUFFER, 0);
}
Source: http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-11-2d-text/
He implemented it to render 2D Text but it's the same concept!
I hope have helped!

Generating Smooth Normals from active Vertex Array

I'm attempting to hack and modify several rendering features of an old opengl fixed pipeline game, by hooking into OpenGl calls, and my current mission is to implement shader lighting. I've already created an appropriate shader program that lights most of my objects correctly, but this game's terrain is drawn with no normal data provided.
The game calls:
void glVertexPointer(GLint size, GLenum type, GLsizei stride, const GLvoid * pointer);
and
void glDrawElements(GLenum mode, GLsizei count, GLenum type, const GLvoid * indices);`
to define and draw the terrain, thus I have these functions both hooked, and I hope to loop through the given vertex array at the pointer, and calculate normals for each surface, on either every DrawElements call or VertexPointer call, but I'm having trouble coming up with an approach to do so - specifically, how to read, iterate over, and understand the data at the pointer. In this case, the usual parameters for the glVertexPointer calls are size = 3, type = GL_float, stride = 16, pointer = some pointer. Hooking glVertexPointer, I don't know how I could iterate through the pointer and grab all the vertices for the mesh, considering I don't know the total count of all the vertices, nor do I understand how the data is structured at the pointer given the stride - and similarly how i should structure the normal array
Would it be a better idea to try to calculate the normals in drawelements for each specified index in the indice array?
Depending on your vertex array building procedure, indices would be the only relevant information for building your normals.
Difining normal average for one vertex is simple if you add a normal field in your vertex array, and sum all the normal calculations parsing your indices array.
You have than to divide each normal sum by the number of repetition in indices, count that you can save in a temporary array following vertex indices (incremented each time a normal is added to the vertex)
so to be more clear:
Vertex[vertexCount]: {Pos,Normal}
normalCount[vertexCount]: int count
Indices[indecesCount]: int vertexIndex
You may have 6 normals per vertex so add a temporary array of normal array to averrage those for each vertex:
NormalTemp[vertexCount][6] {x,y,z}
than parsing your indice array (if it's triangle):
for i=0 to indicesCount step 3
for each triangle top (t from 0 to 2)
NormalTemp[indices[i + t]][normalCount[indices[i+t]]+1] = normal calculation with cross product of vectors ending with other summits or this triangle
normalCount[indices[i+t]]++
than you have to divide your sums by the count
for i=0 to vertexCount step 1
for j=0 to NormalCount[i] step 1
sum += NormalTemp[i][j]
normal[i] = sum / normacount[i]
While I like and have voted up the j-p's answer I would still like to point out that you could get away with calculating one normal per face and just using for all 3 vertices. It would be faster, and easier, and sometimes even more accurate.