glUniform vs. single draw call performance

glUniform vs. single draw call performance - opengl

Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?

There are more than those two choices. At least one more comes to mind:
...
....
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.

Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.

Related

Multiple buffers vs single buffer?

I'm drawing simple 3D shapes and I was wondering in the long run is it better to only use 1 buffer to store all the data of your vertices?
Right now I have arrays of vertex data (positions and colors, per vertex) and I am pushing them to their own separate buffers.
But if I use stride and offset, I could join them into one array but that would become messier and harder to manage.
What is the "traditional" way of doing this?
It feels much cleaner and organized to have separate buffers for each piece of data, but I would imagine it's less efficient.
Is the efficiency increase worth putting it all into a single buffer?

The answer to this is highly usage-dependent.
If all your vertex attributes are highly volatile or highly static, you would probably benefit from interleaving and keeping them all together, as mentioned in the comments.
However, separating the data can yield better performance if one attribute is far more volatile than others. For example, if you have a mesh where you're often changing the vertex positions, but never the texture coordinates, you might benefit from keeping them separate: you would only need to re-upload the positions to the video card, instead of the whole set of attributes. An example of this might be a CPU-driven cloth simulation.
It is also hardware and implementation dependent. Interleaving isn't helpful everywhere, but I've never heard of it having a negative impact. If you can use it, you probably should.
However, since you can't properly interleave if you split the attributes, you're essentially comparing the performance impacts of two unknowns. Will interleaving help on your target hardware/drivers? Will your data benefit from being split? The first there's no real answer to. The second is between you and your data.
Personally, I would suggest just using interleaved single blocks of vertex attributes unless you have a highly specialized need. It cuts the complexity, as opposed to needing to have potentially different systems mixed together in the same back end.
On the other hand, setting up interleaving is a rather complex task as far as memory addressing goes in C++. If you're not designing an entire graphics engine from scratch, I really doubt it's worth the effort for you. But again, that's up to you and your application.
In theory, though, merely grouping together the data you were going to upload to the video card regardless should have little impact. It might be slightly more efficient to group all the attributes together due to reducing the number of calls, but that's again going to be highly driver-dependent.
Unfortunately, I think the simple answer to your question boils down to: "it depends" and "no one really knows".

Should I sort by buffer use when rendering?

I'm designing the sorting part of my rendering engine. I know that changing the render target, shader program, texture bindings, and more are expensive and therefore one should sort the draw order based on them to reduce state changes. However, what about sorting based on what index buffer is bound, and which vertex buffers are used for attributes?
I'm confused about these because VAOs are mandatory and they encapsulate all of that state. So should I peek behind the scenes of vertex array objects (VAOs), see what state they set and sort based on it? Or should I just not care in what order VAOs are called?
This is what confuses me so much about vertex array objects. It makes sense to me to not be switching which buffers are in use over and over and yet VAOs just seem to force one to not care about that.
Is there a general vague or not agreed on order on which to sort stuff for rendering/game engines?
I know that binding a buffer simply changes some global state but surely it must be beneficial to the hardware to draw from the same buffer multiple times, maybe some small cache coherency?

While VAOs are mandated in GL 3.1 without GL_ARB_compatibility or core 3.2+, you do not have to use them the way they are intended... that is to say, you can bind a single VAO for the duration of your application's lifetime and continue to bind and unbind VBOs, etc. the traditional way if this somehow makes your life easier. Valve is famous for advocating doing this in their presentation on porting the Source engine from D3D to GL... I tend to disagree with them on some points though. A lot of things that they mention in their presentation make me cringe as someone who has years of experience with both D3D and OpenGL; they are making suggestions on how to port something to an API they have a minimal working knowledge of.
Getting back to your performance concern though, there can be validation overhead for changing bound resources frequently, so it is actually more than just "simply changing a global state." All GL commands have to do validation in order to determine if they need to set an error state. They will validate your input parameters (which is pretty trivial), as well as the state of any resource the command needs to use (this can be complicated).
Other types of GL objects like FBOs, textures and GLSL programs have more rigorous validation and more complicated memory dependencies than buffer objects and vertex arrays do. Swapping a vertex pointer should be cheaper in the grand scheme of things than most other kinds of object bindings, especially since a lot of stuff can be deferred by an implementation until you actually issue a glDrawElements (...) command.
Nevertheless, the best way to tackle this problem is just to increase reuse of vertex buffers. Object reuse is pretty high to begin with for vertex buffers, if you have 200 instances of the same opaque model in a scene you can potentially draw all 200 of them back-to-back and never have to change a vertex pointer. Materials tend to change far more frequently than actual vertex buffers, and so you would generally sort your drawing first and foremost by material (sub-sorted by associated states like opaque/translucent, texture(s), shader(s), etc.). You can add another level to batch sorting to draw all batches that share the same vertex data after they have been sorted by material. The ultimate goal is usually to minimize the number of draw commands necessary to complete your frame, and using priority/hierarchy-based sorting with emphasis on material often delivers the best results.
Furthermore, if you can fit multiple LODs of your model into a single vertex buffer, instead of swapping between different vertex buffers sometimes you can just draw different sets of indices or even just a different range of indices from a single index buffer. In a very similar way, texture swapping pressure can be alleviated by using packed texture atlases / sprite sheets instead of a single texture object for each texture.
You can definitely squeeze out some performance by reducing the number of changes to vertex array state, but the takeaway message here is that vertex array state is pretty cheap compared to a lot of other states that change frequently. If you can quickly implement a secondary sort to reduce vertex state changes then go for it, but I would not invest a lot of time in anything more sophisticated unless you know it is a bottleneck. Prioritize texture, shader and framebuffer state first as a general rule.

which is the most optimal and correct way to drawing many different dynamic 3D models (they are animated and change every frame)

I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.

TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.

Which is a larger performance drain: quantity of vertices in one draw call, or quantity of calls?

I am quickly finding that one of the organisational considerations you must make when preparing rendering in OpenGL is the type of topography and the arrangement of vertices.
Now there are some interesting methods out there for organising vertices into very long arrays, with nice uses of interleaved arrays, indexes, etc, so that you can pour a lot of geometry into one OpenGL call.
But it's much easier in some cases to simply iterate and perform multiple calls with smaller vertex arrays.
While I agree with the notion that premature optimization is somewhat wasteful, just how important of a consideration should it be to minimize OpenGL calls, especially if multiple calls would actually involve far fewer vertexes per call?
I can see that this is one of those decisions that is important early in the development process, since it forms a lot of the structure of how vertexes get created and organized.

There is an overhead for each command you send down to the GPU. By batching the vertices you minimize that overhead and also allows the driver to make small optimizations in you data before sending it to the hardware. It can make quite a difference and is the reason the glBegin and glEnd was completely removed from newer iterations of OpenGL.
You should try to avoid making many driver states changes and many drawing calls.
EDIT: Consider using degenerated vertices in you triangle strips (also helps in minimizing the number of vertices processed) so that you can just use one drawing call and render all your topology (unless you need to change some driver state between parts of the topology).

You can find a balance for your specific needs. But the thing is that there're many variables in the equation. And there's no simple solution (like "always make scene as one big single batch!"). TraxNet gave you a good advice though - always try to minimize api calls(whether drawing or state changes). But it hasn't to be just a few calls. On modern PC it could be thousands per frame, not so modern mobile phone, maybe, just a few hundred.
Also TraxNet mentioned degenerate triangles(helping form strips) - though they're still triangles(kinda add to 'total' triangle count rendered) - they cost almost nothing still helping to minimize amount of draw calls.

Organizing GLSL shaders in OpenGL engine

Which is better ?
To have one shader program with a lot of uniforms specifying
lights to use, or mappings to do (e.g. I need one mesh to be parallax mapped, and another one parallax/specular mapped). I'd make a cached list of uniforms for lazy transfers, and just change a couple of uniforms for every next mesh if it needs to do so.
To have a lot of shader programs for every needed case, each one with small amount of uniforms, and do the lazy bind with glUseProgram for every mesh if it needs to do so. Here I assume that meshes are properly batched, to avoid redundant switches.

Most modern engines I know have a "shader cache" and use the second option, because apparently it's faster.
Also you can take a look at the ARB_shader_subroutine which allows dynamic linkage. But I think it's only available on DX11 class hardware.

Generally, option 2 will be faster/better unless you have a truly huge number of programs. You can also use buffer objects shared across programs so that you need not reset any values when you change programs.
In addition, once you link a program, you can free all of the shaders that you linked into the program. This will free up all the source code and any pre-link info the driver is keeping around, leaving just the fully-linked program in memory.

I would tend to believe that it depends on the specific application. And yes since it would be more efficient to say run 100 programs where they each may have about 2-16 uniforms each; it may be better to have a trade off of the two. I would tend to think that having say maybe 10 - 20 programs for your most common shading techniques would be sufficient or a few more. For example you might want to have one program / shader to do all your bump mapping, one to do all of your fog effects, one to do reflections, one to do refractions.
Now outside the scope of your question I think it would pertain here as well, one thing to incorporate into your engine would be a BatchProcess & BatchManager class setup to reduce the amount of CPU - GPU calls over the bus as this would prove efficient as well. I don't think there is a 1 fits all solution to your question as I would believe that it would be application specific just as setting up the relationship between how many batches (buckets) of vertices (primitives) your engine would have and how many vertices each of those batches would contain.
To try to make this a bit more clear: one game might have 4 containers or batches where each batch can hold up to 10,000 vertices to be considered to be full before the BatchManager decides to empty that bucket sending all of those vertices to the Graphics Card for the Rendering pipeline to be processed and drawn where a different game may have 10 buckets with 5,000 vertices, or another game might have 8 buckets with 12,0000 vertices.
So there could be a trade off of trying to combine the two according to your needs. If you have 1 single program with 100s of uniforms; the single program is easier to manage within the pipeline, but the shaders would be over cumbersome to read and manage. Then again have shaders with very few uniforms is quite easy to read and manage but having 100s of programs is a little harder to manage on the CPU before linking and sending them to be rendered properly. I would personally try to find a middle ground to where I have enough programs to do each specific task that is completely unique from each other such as doing fog density on one and a volumetric shadow mapping on another where each program has just enough uniforms to do the calculations required.
The next step would then be to do some bench mark testing to see where you efficiency and your overhead are balanced to make the appropriate adjustments.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js